Some really neat by-product of corpus analysis and especially word frequency analysis is that you have a list of the most used word, including where and how many times they appear in the text. And that's how wordclouds are born :-)
But instead of using web tool such as Wordle, you can quite easily generate them yourself with R and the wordcloud-package.
Preparing the text
It gets especially easy when you can use the text mining / tm-package, in which the whole corpus analysis is done for you, and you got nothing else to do but specify a directory with the plain input text(s) and some "cleaning up" you would like to apply, for example:
- Remove punctuation marks:
- Convert corpus to lower case:
mycorpus <- tm_map(mycorpus, tolower)
- Remove stop words like prepositions, conjunctions, pronouns, etc. which appear in every text and do not really contribute to the meaning of a text:
mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("english")))
Here's an example with often used words from my blog posts:
You can also change the color palette you would like to use (here are some nice examples). Here's an example from a swiss news site, to see what's trending :-) because of the smaller "corpus" I reduced the words to be drawn to 60 words, you can specify that in the last line.
R code (requires the tm, wordcloud and RColorBrewer packages, you can get them at your local CRAN site):
library(wordcloud) library(tm) oc <- c("#5C6E00", "#273B00", "#F7DA00", "#EB1200", "#F78200") t <- Corpus(DirSource(directory = "corpus", encoding="utf-8"), readerControl = list(language = "ger")) mycorpus <- tm_map(t, removePunctuation) mycorpus <- tm_map(mycorpus, tolower) mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("german"))) mycorpus <- tm_map(mycorpus, PlainTextDocument) tdm <- TermDocumentMatrix(mycorpus) m <- as.matrix(tdm) v <- sort(rowSums(m), decreasing=TRUE) d <- data.frame(word = names(v),freq=v) wordcloud(d$word,d$freq, scale=c(6,.5),min.freq=2, max.words=100, random.order=F, random.color=T, rot.per=.20, colors=oc)