29. December 2014

Some really neat by-product of corpus analysis and especially word frequency analysis is that you have a list of the most used word, including where and how many times they appear in the text. And that’s how wordclouds are born :-)

But instead of using web tool such as Wordle, you can quite easily generate them yourself with R and the wordcloud-package.

Preparing the text

It gets especially easy when you can use the text mining / tm-package, in which the whole corpus analysis is done for you, and you got nothing else to do but specify a directory with the plain input text(s) and some “cleaning up” you would like to apply, for example:

  • Remove punctuation marks:
    tm_map(t, removePunctuation)
  • Convert corpus to lower case:
    mycorpus <- tm_map(mycorpus, tolower)
  • Remove stop words like prepositions, conjunctions, pronouns, etc. which appear in every text and do not really contribute to the meaning of a text:
    mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("english")))


Here’s an example with often used words from my blog posts:

You can also change the color palette you would like to use (here are some nice examples). Here’s an example from a swiss news site, to see what’s trending :-) because of the smaller “corpus” I reduced the words to be drawn to 60 words, you can specify that in the last line.

R code (requires the tm, wordcloud and RColorBrewer packages, you can get them at your local CRAN site):


oc <- c("#5C6E00", "#273B00", "#F7DA00", "#EB1200", "#F78200")

t <- Corpus(DirSource(directory = "corpus", encoding="utf-8"), 
  readerControl = list(language = "ger"))
mycorpus <- tm_map(t, removePunctuation)
mycorpus <- tm_map(mycorpus, tolower)
mycorpus <- tm_map(mycorpus, 
  function(x) removeWords(x, stopwords("german")))
mycorpus <- tm_map(mycorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(mycorpus)

m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq, scale=c(6,.5),min.freq=2, max.words=100,
   random.order=F, random.color=T, rot.per=.20, colors=oc)

Coding Math