Welcome!

Welcome to the personal website of Adrianus Kleemans.

You can navigate through older stuff on the left in the archive. Here are some interactive posts:

Or, check out the most recent posts below:

adrianus
---

Boxplots

73 days ago

I took a statistics course last fall, and there is a lot of code involved for the whole calculation of different parameters.
“Unfortunately” all of this is in R – which is undisputedly one of the best tools for statistics in general – but I don’t know it well enough for some good results in a few minutes. So I started using Python with matplotlib.

One example are Boxplots, great for a overview of 5 important parameters: the median, the min and max (if in range), and the 50%-box (IQR).

Data

The data is from our professor, he provided weight and height of students from some years ago, 250 students in total.

After calculating the BMI entering the data, a first boxplot which can be automatically generated with boxplot(data) looks as follows:

So, next up is some fine-tuning to make it look better.

Axes

At first, I wanted to take out some black lines from the axes, to make it more focussed on the boxplots themselves. I took some good propositions from here.

At first I removed these unnecessary “spines”:

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)

Then the ticks:

ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')

Then, for better reading, I added some horizintal lines, part of the background grid:

ax.yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)

Text

For better readability at first sight, some more explanation in the title and on the axes:

ax.set_title('BMI-Vergleich von Studierenden')
ax.set_xlabel('Geschlecht')
ax.set_ylabel('BMI')
pylab.xticks([1, 2], ['m', 'w'])

Color

But also the color wasn’t what I had on my mind, the blue is quite aggressive. So I added a inidgo tone to all the elements except for the median. This can be done with setting the parameters for each class separately:

blue = '#0D4F8B' #indigo
pylab.plt.setp(bp['boxes'], color=blue)
pylab.plt.setp(bp['medians'], color='red')
pylab.plt.setp(bp['whiskers'], color=blue)
pylab.plt.setp(bp['fliers'], color=blue)
pylab.plt.setp(bp['caps'], color=blue)

Also, so the picture which is shown (beside the one that is saved) isn’t presented in some grey box, you can add facecolor=“white” when initiating.

So here’s the final Boxplot (click for bigger picture):

Python Code (Python 2.7, matplotlib required)

'''Plots some boxplots about student BMI data.'''
__author__ = 'Adrianus Kleemans'
__date__ = '09.12.2014'
import pylab
# BMI data StatWiSo2003 (m, f) data = [[17.9163, 18.0055, 18.3129, 18.5185, 18.6214, 18.7925, 18.8739, 19.0194, 19.1598, 19.3213, 19.3906, 19.4444, 19.4943, 19.5682, 19.5918, 19.6232, 19.6623, 19.6623, 19.8177, 19.8352, 19.9251, 20.0155, 20.0177, 20.0474, 20.0617, 20.0617, 20.0692, 20.227, 20.2336, 20.2865, 20.3052, 20.3052, 20.3052, 20.3052, 20.3052, 20.5151, 20.5486, 20.6038, 20.6193, 20.7476, 20.8308, 20.9024, 20.9024, 20.9573, 20.984, 20.9961, 21.0498, 21.0498, 21.1327, 21.1927, 21.2228, 21.2245, 21.276, 21.3669, 21.4476, 21.4476, 21.4619, 21.4692, 21.6049, 21.6049, 21.6049, 21.6333, 21.7079, 21.7181, 21.7982, 21.7994, 21.8776, 21.9138, 21.9292, 21.9525, 21.9671, 22.0932, 22.1526, 22.2041, 22.2041, 22.2291, 22.2439, 22.3055, 22.3055, 22.3055, 22.3403, 22.3435, 22.3954, 22.4082, 22.4088, 22.4088, 22.46, 22.4913, 22.4913, 22.4982, 22.5152, 22.5306, 22.5981, 22.5981, 22.5981, 22.6347, 22.6347, 22.6422, 22.7244, 22.7244, 22.7244, 22.7244, 22.7244, 22.7732, 22.8395, 22.8571, 22.9481, 22.9819, 22.9917, 23.0954, 23.1206, 23.1481, 23.1481, 23.1837, 23.2005, 23.2438, 23.2438, 23.2912, 23.4509, 23.5036, 23.5294, 23.5294, 23.6295, 24.0569, 24.0569, 24.0569, 24.093, 24.179, 24.2215, 24.2587, 24.4418, 24.4857, 24.6181, 24.858, 25.3086, 25.7276, 25.9924, 26.2346, 26.5118, 30.7564, 31.2394] , [15.7558, 16.955, 17.8565, 17.9931, 17.9931, 17.9982, 18.0427, 18.132, 18.1449, 18.2183, 18.3391, 18.3391, 18.3655, 18.3768, 18.3768, 18.4013, 18.424, 18.424, 18.424, 18.5185, 18.5911, 18.5911, 18.5911, 18.662, 18.6851, 18.7109, 18.7305, 18.7328, 18.7328, 18.75, 18.75, 18.7783, 18.8707, 18.9388, 19.1001, 19.1406, 19.2336, 19.2894, 19.3213, 19.3337, 19.3625, 19.3698, 19.3792, 19.4674, 19.5312, 19.5312, 19.5717, 19.7232, 19.7531, 19.8352, 19.8839, 19.9219, 19.9481, 20.0617, 20.0692, 20.0773, 20.1951, 20.1956, 20.1956, 20.2449, 20.3816, 20.4382, 20.5191, 20.5499, 20.5693, 20.7008, 20.7031, 20.7612, 20.7612, 21.1073, 21.1389, 21.2585, 21.2585, 21.4109, 21.4533, 21.4692, 21.4844, 21.7181, 21.875, 21.9138, 21.9671, 22.1003, 22.1453, 22.2041, 22.3081, 22.3863, 22.4088, 22.5827, 22.5981, 22.6562, 22.6757, 22.68, 22.7732, 23.1084, 23.1405, 23.2335, 23.3844, 23.939, 24.0346, 24.093, 24.2215, 24.4352, 24.5829, 24.6094, 24.6755, 24.677, 25.0593, 25.6173, 25.9516] ]
# create a figure instance fig = pylab.plt.figure(1, figsize=(9, 6), facecolor="white") ax = fig.add_subplot(111) bp = ax.boxplot(data)
# remove axes and ticks ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['bottom'].set_visible(False) ax.xaxis.set_ticks_position('none') ax.yaxis.set_ticks_position('none')
# some helping lines ax.yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
# Hide these grid behind plot objects ax.set_title('BMI-Vergleich von Studierenden') ax.set_xlabel('Geschlecht') ax.set_ylabel('BMI') pylab.xticks([1, 2], ['m', 'w'])
# color boxplots blue = '#0D4F8B' #indigo pylab.plt.setp(bp['boxes'], color=blue) pylab.plt.setp(bp['medians'], color='red') pylab.plt.setp(bp['whiskers'], color=blue) pylab.plt.setp(bp['fliers'], color=blue) pylab.plt.setp(bp['caps'], color=blue)
fig.savefig('boxplot.png', bbox_inches='tight') pylab.show()
adrianus

---

Searching bookmarks

96 days ago

Ever had the problem that you remember that site, that blog post but you have no idea when or where that was? If you share the same habit of just bookmarking potentially interesting pages, you surely came across the same problem I had last week: Knowing that somewhere in your bookmarks lies that website with the exact thing you’re searching for, but being unable to access it.

And maybe you remember vaguely that back then, it took you hours to google that shit and you’re not willing to do that once again. That feels extremely frustrating. Especially if you’re a chaot like me with zero order in his bookmarks (who’s got time for that, seriously? :)).

So I wrote a python script which downloads all my bookmarked pages and turns them full-text searchable. It works as follows:

  • Step 1: Export your bookmarks as HTML-file and save it as bookmarks.html (the script currently works with Chrome bookmark files. Check here on how to export your bookmarks)
  • Step 2: Run the script and let it download all your bookmarked pages (this could take a while)
  • Step 3: Once downloaded, the information will be pickled (saved) into a file and on the next start, you don’t have to download all the pages again. Neat!
  • Step 4: Enter search terms and all the pages containing them will be returned.

In my case, downloading 1300+ pages takes indeed a while.

After finishing downloading, it generates a ~100 MB pickle:

This pickle will then be used when the programm is restarted.

Searching some terms:

Found existing library, reading.
Enter search terms (quit with q): python threads
http://www.brunningonline.net/simon/python/quick-ref2_0.html
http://pypy.org/
http://www.doughellmann.com/PyMOTW/multiprocessing/basics.html
http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies
http://io9.com/5975592/aaron-swartz-died-innocent-++-here-is-the-evidence
http://morepypy.blogspot.de/2012/08/multicore-programming-in-pypy-and.html
http://docs.python.org/2/library/multiprocessing.html
Enter search terms (quit with q): python processes http://wiki.python.org/moin/PersistenceTools http://python3porting.com/improving.html http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html http://news.ycombinator.com/item?id=3120380 http://www.doughellmann.com/PyMOTW/multiprocessing/basics.html http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies http://morepypy.blogspot.de/2012/08/multicore-programming-in-pypy-and.html http://docs.python.org/2/library/multiprocessing.html
Enter search terms (quit with q): multiprocessing http://python3porting.com/improving.html http://www.doughellmann.com/PyMOTW/multiprocessing/basics.html http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies http://morepypy.blogspot.de/2012/08/multicore-programming-in-pypy-and.html http://docs.python.org/2/library/multiprocessing.html

Code

Python code (Python 2.7, you’ll need html2text):

'''Downloads your bookmarked pages and makes them full-text searchable.'''
__author__ = "Adrianus Kleemans"
__date__ = "December 2014"
import os import os.path import pickle from urllib2 import * import html2text import HTMLParser
def download(url): f = urlopen(url, timeout=5) data = f.read() f.close() charset = f.headers.getparam('charset')
if charset is not None: try: udata = data.decode(charset) data = udata.encode('ascii', 'ignore') except (UnicodeDecodeError, LookupError): print 'Error decoding with charset=', charset
return data
def download_library(library_pickle): library = {} f = open('bookmarks.html', 'r') bookmark_file = f.read() f.close()
bookmarks = bookmark_file.split('A HREF="') del bookmarks[0]
for i in range(len(bookmarks)): bookmarks[i] = bookmarks[i].split('"')[0] print 'Found', len(bookmarks), 'bookmarks.'
# download bookmarked pages for bookmark in bookmarks: if bookmark not in library: print 'Downloading', bookmark[:50], '...' txt = '' try: txt = download(bookmark) txt = html2text.html2text(txt) except Exception, e: print 'Error:', e library[bookmark] = txt
pickle.dump(library, open(library_pickle, 'wb'))
def main(): library_pickle = 'bookmarks.p' if os.path.exists(library_pickle): print 'Found existing library, reading.' else: print 'No library found. Downloading, this could take a while...' download_library(library_pickle) library = pickle.load(open(library_pickle, 'rb'))
# main search loop while True: keywords = raw_input('\nEnter search terms (quit with q): ') keywords = keywords.split(' ') if keywords == ['q']: break for bookmark in list(library): is_candidate = True for keyword in keywords: if keyword not in library[bookmark] and keyword not in bookmark: is_candidate = False if is_candidate: print bookmark print 'Finished.'
if __name__ == "__main__": main()
adrianus

---

Bayeux Meme

109 days ago

Ursprünglich aus dem 68 Meter langen Teppich von Bayeux

und dem darauf basierenden Werkzeugkasten Historic Tale Construction Kit sind aus diesem riesigen Berg an Material eine Fülle von Memes entstanden. Man munkelt sogar, dass es sich dabei um die ersten deutschen Memes überhaupt handelt.

Guess the meme! :-)

adrianus

---

Wordclouds

117 days ago

Some really neat by-product of corpus analysis and especially word frequency analysis is that you have a list of the most used word, including where and how many times they appear in the text. And that’s how wordclouds are born :-)

But instead of using web tool such as Wordle, you can quite easily generate them yourself with R and the wordcloud-package.

Preparing the text

It gets especially easy when you can use the text mining / tm-package, in which the whole corpus analysis is done for you, and you got nothing else to do but specify a directory with the plain input text(s) and some “cleaning up” you would like to apply, for example:

  • Remove punctuation marks:
    tm_map(t, removePunctuation)
  • Convert corpus to lower case:
    mycorpus <- tm_map(mycorpus, tolower)
  • Remove stop words like prepositions, conjunctions, pronouns, etc. which appear in every text and do not really contribute to the meaning of a text:
    mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("english")))

Plotting

Here’s an example with often used words from my blog posts:

You can also change the color palette you would like to use (here are some nice examples). Here’s an example from a swiss news site, to see what’s trending :-) because of the smaller “corpus” I reduced the words to be drawn to 60 words, you can specify that in the last line.

R code (requires the tm, wordcloud and RColorBrewer packages, you can get them at your local CRAN site):

library(wordcloud)
library(tm)
oc <- c("#5C6E00", "#273B00", "#F7DA00", "#EB1200", "#F78200") # color scheme
t <- Corpus(DirSource(directory = "corpus", encoding="utf-8"), readerControl = list(language = "ger")) mycorpus <- tm_map(t, removePunctuation) mycorpus <- tm_map(mycorpus, tolower) mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("german"))) mycorpus <- tm_map(mycorpus, PlainTextDocument) tdm <- TermDocumentMatrix(mycorpus) m <- as.matrix(tdm) v <- sort(rowSums(m), decreasing=TRUE) d <- data.frame(word = names(v),freq=v) wordcloud(d$word, d$freq, scale=c(6,.5), min.freq=2, max.words=100, random.order=F, random.color=T, rot.per=.20, colors=oc)
adrianus

---

Comparing images

134 days ago

Some months back I stumbled across this revealing blog post from Silviu Tantos from iconfinder.com. It’s about how to compare two images and to quantify the difference into a single number, showing how much an image looks like another image.

For example: How much do these two images, a surprised koala and the same image but only with a fancy red hat, really look like one another? 80%? 90%? More?

And how yould you calculate such a difference without having to iterate over all the pixels?

I remember being intrigued back then when Google released its ‘search by image’ feature, by which I was equally impressed at the time. How was it possible to determine (with such accuracy!) if two images “look” like each other, or even to search for them?

The simple idea behind the whole algorithm described in the post really is fascinating and made it feel like being handed a well-kept secret :-) So I’d like to share some aspects of it accompanied by a simple Python script.

Scaling and removing colors

First of all, we want to compare all sorts of images, so also all kind of sizes. So we have to scale them down to a common size. If we choose a small size, calculation will be easier later on. For example 9×8 pixels (an uneven number in the linesize to make the dhash function work, see below).

Our two koalas will look like follows (enlarged for better visibility):

And after greyscaling, only the “intensity values” will stay, the R-G-B triplet will be reduced to one simple value. Technically, this step is done before, but for illustrating its effect, here is it now :-):

Not much of a difference now, eh?

Hashing: dhash

Now we have some scaled down pictures – but we need to transform this image data somehow into numeric values, suitable for fast comparison. Generating hashes comes to mind.

So, what hashing algorithm to choose? dhash, an algorithm which compares neighbor pixels if they get darker or lighter, seems so be very accurate and also fast.

To apply it, all our pixels (represented as “intensity values”) from our shrinked, grayscaled images will be compared to their right neighbor, so some kind of gradient will be established:

  • If the right neighbors is lighter (or the same), write a 0
  • If the right neighbor is darker, write 1

You can see that in the first row, the pixel first lightens up (0), then stays the same (0) and the fourth pixel in row is darker than the third => 1.

After this, we will end up with 8*8 values (for hash length 8) from which we can build a hex-hash. For the two koalas we end up with the following hashes:

= 2e75c5a3c7cd4d4e
= 2e67c5a3c7cd4d4e

You see straight here that they look nearly the same, and for an algorithm it is easy to compare bitwise how much of a difference there really is:

0010111001110101110001011010001111000111110011010100110101001110
0010111001100111110001011010001111000111110011010100110101001110
___________x__x_________________________________________________
= 2 bits difference

So the difference of the two hashes is 2 out of 64 bits in total, which makes the second image 96.875% similar to the first one from this point of view :-)

I changed the dhash function (in comparison to the blog post mentioned above) here so that it displays exactly the given bits at the right position, which also simplifies the code.

00101110 | 2e
01110101 | 75
11000101 | c5
...

For example, as in the picture above the difference-map starts with 0010 1110, which is in hex “2e”, so the hex strings are a adequate representation which can be immediatly reproduced by looking at the image. On a bitwise level this doesn’t make any changes as long as all the hex strings are generated the same way.

Comparing koalas

As a sample, I compared some other koala images. Three are modifications from the original image: with an additional light source in the upper left corner, repainted as dots, a skew version, and one reference with a completely different koala.

koala_light koala_dots koala_skew other_koala

They scored as follows, value 0 as absolute no difference between the images, in respect to the “original” koala image from the beginning:

We see that the algorithm is robust against scaling or even to replotting with other techniques, and that the hash function is affected the most from lightning changes, for example re-highlighting the image with another light source. But even in this case the differences are reasonably small (5 bits difference).

Code

The code checks for all .jpg-images in the current working directory, picks the first one (in alphabetical order) and compares it to the others. Hashes are calculated with dhash, differences between hashes with diff, and then a horizontal bar plot is drawn with use of matplotlib.

Python Code (Python 2.7, using PIL and matplotlib/pylab and some koala images :-)

#!/usr/bin/python
# -*- coding: utf-8 -*-
'''Image comparison script with the help of PIL.'''
__author__  = "Adrianus Kleemans"
__date__    = "30.11.2014"
import os import math, operator from PIL import Image import pylab
def diff(h1, h2): return sum([bin(int(a, 16) ^ int(b, 16)).count('1') for a, b in zip(h1, h2)])
def dhash(image, hash_size = 8): # scaling and grayscaling image = image.convert('L').resize((hash_size + 1, hash_size), Image.ANTIALIAS) pixels = list(image.getdata())
# calculate differences diff_map = [] for row in range(hash_size): for col in range(hash_size): diff_map.append(image.getpixel((col, row)) > image.getpixel((col + 1, row))) # build hex string return hex(sum(2**i*b for i, b in enumerate(reversed(diff_map))))[2:-1]
def main(): # detect all pictures pictures = [] os.chdir(".") for f in os.listdir("."): if f.endswith('.jpg'): pictures.append(f)
# compare with first picture image1 = Image.open(pictures[0]) h1 = dhash(image1) print 'Checking picture', pictures[0], '(hash:', h1, ')'
data = [] xlabels = [] for j in range(1, len(pictures)): image2 = Image.open(pictures[j]) h2 = dhash(image2) print 'Hash of', pictures[j], 'is', h2 xlabels.append(pictures[j]) data.append(diff(h1, h2))
# plot results fig, ax = pylab.plt.subplots(facecolor='white') pos = pylab.arange(len(data))+.5
ax.set_xlabel('difference in bits') ax.set_title('Bitwise difference of picture hashes') barlist = pylab.plt.barh(pos, data, align='center', color='#E44424') pylab.yticks(pos, xlabels) pylab.grid(True) pylab.plt.show()
if __name__ == '__main__': main()
adrianus

---

« Older