Welcome to the personal website & blog of Adrianus Kleemans.
Archive is to the left, here are some interactive posts:

Or, check out the most recent posts below:


Recursive SQL

10 days ago

Recently I had to write some SQL to connect some CIs (for example, servers) with their parent applications. The crucial point was that there could be arbitrary many layers of applications above the server.

Here’s an example with 3 layers of applications (the → indicates “… is parent of …”):

All the relations are in the same table, so there’s not really an indicator on which hierarchical level an application is until you search for more relations to other connected applications.
Technically, the goal was to join the table by itself n-times, whereas n may be different for each entry:

As this is not possible by simple joining, I had to find another way to connect the different applications with each other. Let’s start with a simple table with 11 relations:

rel_nr   parent_id   child_id  
0 App A App B
1 App B App C
2 App C Server A
3 App C Server B
4 App C Server C
5 App B App D
6 App D Server D
7 App D Server E
8 App A App E
9 App E Server F
10 App E Server G

As you can see App A is the root node with its children (App B, App E), which again have their own children.


To query the table recursively, one can use Oracle’s CONNECT BY, and connect the child_id with the corresponding parent_id:

  CONNECT_BY_ROOT parent_id root_id,          -- root application
  SYS_CONNECT_BY_PATH(child_id, ' -> ') path, -- path
  LEVEL depth                                 -- tree depth    
CONNECT BY NOCYCLE PRIOR child_id = parent_id -- connect w/ PRIOR
  depth, root_id, parent_id

With CONNECT BY … PRIOR the chaining is initiated, connecting Applications who are a child and a parent at the same time at a different place in the table.

This will result in the complete list of existing relations, complete with root_id, path and also a depth for each relation, indicating how many nodes have been found:

Quite some paths found! As you can see, also incomplete paths “in the middle” are shown, for example App B → App D, a path which misses the root node (App A) and also the leaf node (Server D or E), but is nonetheless a valid path.

Narrowing it down

Because we’re not interested in all the paths, we have to narrow it down to paths which start at our root Application, App A. Luckily, Oracle provides an addition to the CONNECT BY clause, namely START WITH. Here we can specify that we’re not interested in any path other than starting with App A, and efficiently rule out the whole rest.

FROM app_relations
START WITH parent_id = 'App A'
CONNECT BY NOCYCLE PRIOR child_id = parent_id

Now we have all relations from our root node, and we can as a last step just take those which end with a server:

  root_id || path AS complete_path,
FROM ( ... ) -- subquery from above
  child_id like 'Server%'   -- only paths which end with a server.

This gives us the list:

  • App A -> App E -> Server F
  • App A -> App E -> Server G
  • App A -> App B -> App C -> Server A
  • App A -> App B -> App C -> Server C
  • App A -> App B -> App C -> Server B
  • App A -> App B -> App D -> Server D
  • App A -> App B -> App D -> Server E

Hierarchy tree

Using the jQuery plugin jsTree and a little helper library we can generate a clickable Javascript tree:

(click on it! :-) if you’re using an older browser, click here to see how it should look)

For further reading and some nice examples of CONNECT BY have a look here. Thanks for reading!



How to sort by rating

107 days ago

Sometimes there’s one good way to do something and many poor ways (like chosing the right CSV delimiter). I really like Evan Miller’s post about how to sort ratings and will try to outline the idea and show an example with a star-based rating.

The problem

The problem: What is the right way to sort items (products, comments, …) with an average rating by n users?

By the right way, I mean rating items in the most helpful way for other users – by providing reliable information about the quality of an item (quality as perceived by the users, we don’t know anything real about the quality of the item).

Obiously both of the factors – average rating and the number of users – matter. A single 5 star rating isn’t nearly as trustworthy as 20 ratings with an average of 4.8 stars, which intuitively makes sense. Also, an average rating of 4 should be ranked higher than an average of 3.5 with roughly the same amount of users.

An example: sorting products

Let’s take the swiss online shop Digitec as an example. Searching for “Nexus” yields some results, for example the Tablet Nexus 9 and the Phone Nexus 6:

The rating for each is rounded to 4.5 stars each, but on the product site you can see the individual ratings, and they average as follows:

  • Nexus 9 Tablet: 4.57 / 5 stars, rated by 7 users
  • Nexus 6 Phone: 4.48 / 5 stars, rated by 54 users

We have an average rating of n users, but how sure can we be this matches the real rating? What we’re really interested in, is the hidden, real rating of the item!

For example, if we take all the customers of a product and compare it to those who actually submit a review, it makes sense that the bigger the number of users who gave a review, the more accurate (= nearer to the real rating) our average rating is:

Chances are that we were unlucky and just picked some users with extreme ratings, then our average will be too high or too low.

To be totally sure we get the real rating, we’d have to ask all the people who bought it, but that information is not available. But if we say, we want to be 95% sure that our calculated average rating isn’t lower than a certain bound?

We also should think about how that group is chosen from the total of all customers. Because the group is likely to be biased (maybe users with a negative experience are more likely to share it etc.) we can’t really make any assumptions about the underlying distribution and therefore interpret our values as parameters for a confidence interval for an unknown p of a binomial distribution.

The solution

The solution: We calculate a lower bound, given the average rating and our n users, that we can at least be 95% (or 90% or whatever) sure that the real rating isn’t below this bound!

To be exact, we need the lower bound of the α-binomial proportion confidence interval (with our parameters average rating and n). To our two factors a third comes into play, α, which indicates how sure we want to be. A common value is 95%, so that in 95% of the cases (on average) we’re right.

We can approximate this lower bound with the Clopper-Pearson interval or another method (like Wilson’s interval). This gets us the following (for code see below):

  • Tablet: 4.57 / 5 with 7 ratings: real rating is at least 2.84 stars (with 95% certainty)
  • Phone: 4.48 / 5 with 54 ratings: real rating is at least 4.00 stars (with 95% certainty)


The Nexus 9 Tablet only got a lower bound of 2.84 stars (which could be expected with only 7 ratings), whereas the Nexus 6 Phone scores 4 stars! So the Phone should be clearly higher ranked than the Tablet:

(link to my digitec search)

Code and appendix

We can express the Clopper-Pearson interval in terms of the beta distribution, as stated in Wikipedia, which gives us for the lower bound:

where x is the number of successes in a binary ± rating system and n is the number of ratings. Translated to our star-based rating system (see also here) this turns into:

stats.beta.ppf(alpha / 2, rating*n, n - rating*n + 1)

Also, because 1 star is the lowest rating, we have to normalize the 1 to 5 star ratings to a [0, 1] interval with

rating = (rating-1)/4

Python code for this example (2.7, requires scipy for the beta function):

from __future__ import division
import scipy.stats as stats

def lower_bounds(rating, n, alpha=0.05):
    ''' Assuming a 1 to 5 star-rating '''
    rating = (rating-1)/4
    ci_low = stats.beta.ppf(alpha / 2, rating*n, n - rating*n + 1)
    return ci_low*4 + 1

items = [['Nexus Tablet', 32/7, 7], ['Nexus Phone', 242/54, 54]]
for i in range(len(items)):
    items[i] = items[i] + [lower_bounds(items[i][1], items[i][2])]

for item in sorted(items, key=lambda x: x[3], reverse=True):
    print item

Returns the following (name, avg. rating, # of ratings, lower bound):

['Nexus Phone', 4.481481481481482, 54, 4.0039512303867282]
['Nexus Tablet', 4.571428571428571, 7, 2.8357283010686061]

Thanks for reading!



10 useful learning habits

183 days ago

Last fall I took the “Introduction to statistics” course at the University Berne (see for example my post about Boxplots). It was my last course after 7 years of studying, but it turned out to be the most insightful one I had so far.
Working part-time, I had lots of time to spend on a single course, and also to reflect a bit on learning as a habit and gaining some deeper knowledge about my own learning patterns and difficulties I had, and I’d like to share some of those thoughts.

1. Write your own summaries

I remember using a summary from a friend (“Neat, I don’t have to do it myself!”) for an oral exam in a history subject. I tried to get a grip on the things he wrote, in the way he wrote it – but it was just not possible. The information was already quite dense andtotally okay for a quick overview, but when I sat down with my docent, I didn’t remember anything from it. Crafting your own summary can be fun too, and help you to prepare the learning material in a way you’re getting the most out of it (see Get your gear ready).

Write down your Eureka!-moments. When you learn, sometimes it just feels like a knot unravels and you really get that warm, exciting feeling of a deeper understanding in a certain topic. Most of the times these moments are about small things – a procedure which finally makes sense, or you get what has been used to solve a certain exercise.
I noticed that writing those down has two primary benefits: You remember better what exactly you understood, and also it’s easy to understand that particular thing again really quick, which is crucial in exam preparation, when looking through the important stuff again.

Sometimes it’s even a good idea to just focus on these little moments exclusively, to just be aware where your very own difficulties are when solving exercises, and take corresponding measures. For example, I noticed that I sometimes had trouble with integrals, and although it wasn’t even part of the lecture but a prerequisite, it came up quite a few times before I decided to do a little refresh on the topic.

Keep digging. When summarizing an exercise or a concept (or ideally explaining it to someone) you typically understand a whole lot but also come across some unknowns. While it is important to focus on important things, digging deeper has a similar property to asking questions, pondering on it helps you get an even deeper understanding.
When producing a summary, those things you don’t understand emerge much clearer than when you just consume and learn without actively recreating anything, so it is a good practice to at least write them down, and maybe look at them another day or discuss them with other people.

2. Ask questions

One of the most important things I learnt was to ask questions. I used to be (and still am sometimes) a bit of a loner when it comes to learning, I’m just much more focused when I’m alone and also the planning is much easier. If I’m tired, I skip a learning session, if I feel like it and I’m making good progress I can learn all into the night.

Use the available human resources such as course assistants, other students, ask questions, talk to people, send emails. Understanding isn’t just about reading text books or solving predefined exercises, but also about an active communication with other human beings. Sometimes I could ponder literally for hours on a single problem, and taking it with me into class and discussing it with other people solved it within minutes. Communication is key.

Write down your questions. Writing down clarifies the exact problem of understanding and sometimes even solves the problem, or at least breaks it down to what exactly you don’t understand. Think it over another time, ask other people for help – sometimes what’s difficult for you is really easy to explain for others.

Follow the exercise schedule. If a question hour is provided, be sure to use it – It’s like a free private coaching! Make use of that, write down your problems, ask questions. Thinking “Oh, I didn’t manage to finish the exercises this week, I’ll do them next week” is a vicious cycle and really hard to break, even if you have plenty of time available.
This is why: If you try to solve a problem, fail, think about it for some time, boil it down to what exactly you don’t understand and then solve it with your tutor – this is experience and understanding on the subject you’ll never be able to get just from looking up the solutions some weeks before the exam.

3. Be aware of your different focus levels and use them

Different parts of learning require a different level of concentration – for example, memorizing words or copying notes from a friend don’t take as much concentration as getting a new concept or trying to solve a difficult question. Often learning combines at least some different activities, which you can easily split up into high- or low-concentration tasks.

Doing so is crucial, especially if you’re working and can’t (or don’t want to) spend all of your time learning, you have to cleverly manage your time.

Create a high-concentration environment without any distractions and visit it regularly.
I have some space at home, but also used to stay at the library in the university or in my company after work, in a quiet, distraction-free room to really be able to do some serious studying. This is where the real learning happens and where I gain most of my insights.
Use it only if you feel like it and if you’re ready, it doesn’t make much sense for trying tired or exhausted after a long day of work. Sometimes it’s also hard to really focus and don’t procrastinate by doing things you could do later, like organizing stuff or browsing through forums for a specific problem.

Save other activities like organizing your material or learning stuff by heart for low concentration environments, such as a train ride, a break at work, watching TV, etc. Doing so can save you a lot of stress, as you know exactly which tasks to do when. It takes away the feeling which puts you under pressure when you are doing other things, like “I should be doing some studying right now” – no! You’re aware that all you can do now, casually, is some repetition or light reading and that you’re ready to deal with complex problems in the sessions where you are highly focused, well rested, and undistracted.

4. Get your gear ready

In humanities, I used to learn out of the textbook and take some notes, learn from these and maybe read through all of it again before the exam. Then I started studying computer science and suddenly became aware that this just wasn’t enough anymore. I had to find efficient ways to not only learn new stuff, but also being able to remember it for a long time.

Find suitable forms of representation. Once you get the idea of how much you’ll be learning and what stuff to remember, you’ll have to process and represent it in a way which lets you get a quick hold on it, so once you’ll look at it you’ll go “oh wait, I know this one and how it works”.
Obviously, this will not be possible with everything, so you have to write down more complex things too. I generally distinguished between:

  • simple facts or formulas, which I would write down on these little memory cards
  • algorithms or complex procedures, which I would first try to understand, then summarize them and put them down with some good examples in a dedicated notebook. Adding some colors, graphics and extra facts about certain problems really turned it into a neat summary of nearly the whole lecture and I enjoyed working with it very much.
    I also recognized that I’m the graphical learning type and benefit from using colors and fancy drawing stuff – I think this is quite individual, others try to remember things with little stories or even by singing them :-) As long as you find out what gets you going.

Gather every bit of relevant information you can get. I used to have one big Dropbox folder with literally everything about the lecture in it – descriptions, links to websites (Wikipedia articles, specialized pages like Matroids math site), exercises, solutions, all the stuff from last year’s lecture, script, my own notes from the lecture, other notes, etc. etc.
I really liked the freedom to just pull out my tablet anywhere – for example, I was in the Netherlands (twice) in the time before the exam, visiting family, and did a good part of the learning there – just because I had everything ready. I even scanned in my whole 70 pages of notes on the lecture the day before I left to have everything available.

5. Get used to regular learning

One thing that really helped me as a quite spontaneous person were the fixed learning sessions I planned on a regular schedule, spread across the week. I had mine on Monday morning, Wednesday afternoon and Saturday morning, also because work let me. They changed over time (see Adapt your learning behaviour over time), becoming more frequent as the exam came closer. I arranged with myself, that no matter what other things were around, I would always use these sessions to do some studying, and after some trouble in the beginning I managed to integrate them into my daily routine.

These learning sessions were also the place where I tried to get “in the zone”, to create a high-focus environment as already described, so I could rely on those sessions to work through the stuff that popped up during my work week, which I used to write down on a list especially for that purpose. It also took me a while to realize that I wouldn’t be able to use my time given efficiently when learning at home, too much distractions. It’s really hard to learn while you could do some housekeeping or cleaning, too :-). Going to a quiet room at the university or the library helped a lot.

6. Prioritize

Identify important topics. Depending on the style of lecture, you’ll have some kind of exercises or papers to turn in, which gives you a rough estimate of what to expect later on. As we had weekly task assignments to hand in, there were lots of topics covered and the idea on important topics was quite vague.

Throughout my studies, I noticed some indicators for important topics:

  • An exercise (or part of it) keeps coming up through multiple task assignments.
  • A certain concept is the foundation for other exercises, or there’s a common thread through a part of the exercises. Those kind of exercises also make nice exam questions, where the different parts of a questions are intertwined.

For example, I once drew a dependency model (of topics) in linear algebra:

As you can clearly see, the orange marked balloons are really the foundation for pretty much everything else, so it would be a clever thing to invest some extra time into them. (Also, there were lots of exercises which combined some of the marked bubbles.)

  • Carefully listening to what the prof said gave me some good hints. Usually the exam won’t consist of topics the prof finds unimportant or boring.
  • Asking the course assistants also helped me a lot, usually because of their experience they have a good sense for more important topics and will tell you so.

Divide and conquer. When faced with a big task, for example, learning to integrate, it’s mostly too big to cope in a single session. Also beginning is really hard if you see the dimensions of the things you want to cope with and you stand before a mountain of work. Dividing the task is at utmost importance, before you despair. Most of the time it is a good idea to begin by looking at practical examples like a specific task: How can you integrate 2x^2? Building on some basic tasks and expanding the scope by conquering certain parts (“Oh, powers of x are mostly treated the same way. Let’s learn some other common rules of integration.”), you’ll be able to gradually build up knowledge.

Don’t get lost. This one’s sometimes a hard one to spot – finding the compromise between breadth and depth on a certain topic. It’s tempting to just stick to something you’re understanding instead of moving on and facing the stuff that gives you a headache.
I know some people who were really confident before the exam, which I knew studied a lot, and would then say afterwards, “Wow, what was that – I just totally blew it because I learnt the wrong things!”. As trivial as it sounds, learning the right things is important and success can be misleading and a poor guide in this regard.

Ask yourself how much time you’ve got left – realistically – and adjust your learning depth accordingly. You don’t want to end with some important topics uncovered! Identifying the most important topics as mentioned before is extremely helpful not only for learning the right things, but also for planning and time allotment.

7. Track your progress

Keep a diary of what you do and learn – just a few key-words or topics you’ve been working on, one line per learning session. It pushes your self-confidence to see what you already accomplished, and it’s also a neat summary of the topics you already covered or where you should invest some more time and effort.

Get a grasp of the important things already described in section Prioritize, and make a list of exercises, or of topics (or both!), to see how you’re going on a big scale. Draw a mind map and see what needs you attention and what you already covered sufficiently.

8. Apply your knowledge in real life

Another thing that really helped me was the application of what we learnt (quite easy in statistics, but also useful on other subjects) – not just in the weekly assignments, but testing it on some real life data. And to maintain your interest to invest some additional time – since it is not directly part of the syllabus, strictly speaking, and an additional workload – it is important to search for some interesting examples.

An example: We had some data provided from students the years before us, a questionnaire filled out with some questions about origin, habitation, income and so on. One category was “smoking”, and they could answer with “yes”, “no” or “occasionally”. I looked at the categories men/women and compared their smoking behavior, which where nearly identical, relatively speaking, in the “yes” and “no” categories. I then used Fisher’s exact Test) to show that if you pick an occasional smoker at random (from a group like the one the data came from, and with as many men as women), the odds would be nearly 1.7x higher for that person to be a woman.

Without even thinking if the group size was representative (it wasn’t) or the group was skewed (it was) the data still yielded a new and unexpected result, which helped me understand the meaning of statistical odds and simultaneously showing some results on real data.

Connecting your knowledge. The even bigger benefit of practicing methods and techniques is the newly gained experience when to use which concept from the vast amount of information and skills you’re learning. Asking yourself the questions, “Is this concept appropriate in this case? Will the result of this test have any meaning?” and even remembering what procedures are available at all helps you to connect and classify the new knowledge.

Another big advantage is that if you choose good examples, they are more likely to stay in your memory. Even now, months after my exam, I still recall most of my own data experiments in detail – opposed to the given tasks we had to solve.

9. Be persistent

As time goes by and you’re in a continuous flow of learning, sometimes you just get thrown out of the routine, either by a lack of motivation, getting sick, going on holidays or whatever. I thought for a long time that this (immutable) process was just plain failure – that I was not able to be a really continuous learner. Over time I realized that only my reaction to these interruptions could be changed, all other things are plain luck and not in my influence.

Get going again – maybe the simplest rule, but also the hardest – it’s normal to lose one’s rhythm, but the difficult part is to get going again after a break and getting back on track. One good practice is to think long-term: I want to succeed, therefore I have to start learning again, if not today then tomorrow. It’s not really a decision if, but when to start learning again, and every day I procrastinate is a day closer to the exam, unused.

Actively care about your motivation. One thing that really helped me stay motivated was the application of stuff we learned, so I tried to find some interesting problems to solve (see Apply your knowledge). It’s really fueling to have some real life stuff to work with and the feeling to already making use of the newly acquired knowledge.

It also helped me to set some intermediate goals, which would typically split a learning session into three or four parts, and which also were a good indicator to take a break and grant myself with a cup of coffee, a walk around the block etc. Enjoying a free afternoon is so much more rewarding after you had a good learning session or some items ticked off the list.

10. Adapt your learning behavior over time

There are usually several months of studying involved for a single lecture, and it doesn’t make much sense to just stubbornly go through everything in chronological order (see Prioritize), or even to stick to a certain way of learning from the beginning until the end, because we don’t work that way. Our brain learns by repetition and actively using the new skills (see Apply your knowledge).

Because you make quite a big progress from understanding nothing and being thrown at with new concepts to gradually getting a grip to – hopefully – a wholesome understanding, also your way of learning should change. Besides your own progress, there are also the circumstances that change, for example the lecture finishes and you’ll get an additional month or two until the exam is due, with more time on your hands to spend.

In respect to those two factors, I used to split my learning into 3 stages:

  • Lecture stage
  • Understanding stage
  • Reviewing stage

Lecture stage. The beginning of the course with lectures to attend can be quite stressful, so the main motto here was for me: do the weekly exercises, write down everything you hear, survive :-)

Understanding stage. Like preparing for a marathon, this is the main stage after warming up to lay a good foundation for everything that follows. It is the last time it makes sense to catch up on things that were left behind earlier. Usually this is the most intense stage in the whole learning process. This is also the time to work out a neat summary (see Write your own summary). Here you should try to acquire and deepen your knowledge, apply it, and do some additional work, if time allows.

Reviewing stage. In the weeks before the exam, you should recognize a shift of what you do and learn, the focus lies more on reviewing already learnt material rather than looking at completely new problems. Here you should have seen everything at least once, and coming across something you don’t understand yet you have to decide whether to start working on it or to leave it aside (see Prioritize).
Focus on learning by heart, connecting the knowledge, and understanding the higher purpose of the concepts, why and when to use them.

Prepare for the modus operandi of the exam. Last of all, you should prepare yourself specifically for the exam – what kind of circumstances do apply, what do you have to know by heart? How many tasks will there be, how much time do you have for each of them? Also, most of the time writing a summary is just for the sake of learning, not for bringing it with you at the exam. So plan enough time not only to write your summaries, but also to look at them and memorize – eventually it has to be in your head, not on paper.

Learning with memory cards really did the job for me, working with a simple system where cards get laid aside if I know them, and from time to time I would repeat the whole stack. Then, in the days before the exam, things I didn’t know or understand by then I learnt by heart, to be at least able to write something down, should it come up in the exam.

Never try to understand something on the day before the exam. One of the most important rules regarding the exam – if you don’t get it two days before the exam it’s just too late. If you try to get something new, it will only distract you and make you feel unsure, focusing on things you can’t do instead of relying on everything you’ve learnt so far.

Good luck ;-)




287 days ago

I took a statistics course last fall, and there is a lot of code involved for the whole calculation of different parameters.
“Unfortunately” all of this is in R – which is undisputedly one of the best tools for statistics in general – but I don’t know it well enough for some good results in a few minutes. So I started using Python with matplotlib.

One example are Boxplots, great for a overview of 5 important parameters: the median, the min and max (if in range), and the 50%-box (IQR).


The data is from our professor, he provided weight and height of students from some years ago, 250 students in total.

After calculating the BMI entering the data, a first boxplot which can be automatically generated with boxplot(data) looks as follows:

So, next up is some fine-tuning to make it look better.


At first, I wanted to take out some black lines from the axes, to make it more focussed on the boxplots themselves. I took some good propositions from here.

At first I removed these unnecessary “spines”:


Then the ticks:


Then, for better reading, I added some horizintal lines, part of the background grid:

ax.yaxis.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)


For better readability at first sight, some more explanation in the title and on the axes:

ax.set_title('BMI-Vergleich von Studierenden')
pylab.xticks([1, 2], ['m', 'w'])


But also the color wasn’t what I had on my mind, the blue is quite aggressive. So I added a inidgo tone to all the elements except for the median. This can be done with setting the parameters for each class separately:

blue = '#0D4F8B' #indigo
pylab.plt.setp(bp['boxes'], color=blue)
pylab.plt.setp(bp['medians'], color='red')
pylab.plt.setp(bp['whiskers'], color=blue)
pylab.plt.setp(bp['fliers'], color=blue)
pylab.plt.setp(bp['caps'], color=blue)

Also, so the picture which is shown (beside the one that is saved) isn’t presented in some grey box, you can add facecolor=“white” when initiating.

So here’s the final Boxplot (click for bigger picture):

Python Code (Python 2.7, matplotlib required)

'''Plots some boxplots about student BMI data.'''
__author__ = 'Adrianus Kleemans'
__date__ = '09.12.2014'

import pylab

# BMI data StatWiSo2003 (m, f)
data = [[17.9163, ... ]]

# create a figure instance
fig = pylab.plt.figure(1, figsize=(9, 6), facecolor="white")
ax = fig.add_subplot(111)
bp = ax.boxplot(data)

# remove axes and ticks

# some helping lines
ax.yaxis.grid(True, linestyle='-', which='major', 
    color='lightgrey', alpha=0.5)

# Hide these grid behind plot objects
ax.set_title('BMI-Vergleich von Studierenden')
pylab.xticks([1, 2], ['m', 'w'])

# color boxplots
blue = '#0D4F8B' #indigo
pylab.plt.setp(bp['boxes'], color=blue)
pylab.plt.setp(bp['medians'], color='red')
pylab.plt.setp(bp['whiskers'], color=blue)
pylab.plt.setp(bp['fliers'], color=blue)
pylab.plt.setp(bp['caps'], color=blue)

fig.savefig('boxplot.png', bbox_inches='tight')

That’s it!



Searching bookmarks

310 days ago

Ever had the problem that you remember that site, that blog post but you have no idea when or where that was? If you share the same habit of just bookmarking potentially interesting pages, you surely came across the same problem I had last week: Knowing that somewhere in your bookmarks lies that website with the exact thing you’re searching for, but being unable to access it.

And maybe you remember vaguely that back then, it took you hours to google that shit and you’re not willing to do that once again. That feels extremely frustrating. Especially if you’re a chaot like me with zero order in his bookmarks (who’s got time for that, seriously? :)).

So I wrote a python script which downloads all my bookmarked pages and turns them full-text searchable. It works as follows:

  • Step 1: Export your bookmarks as HTML-file and save it as bookmarks.html (the script currently works with Chrome bookmark files. Check here on how to export your bookmarks)
  • Step 2: Run the script and let it download all your bookmarked pages (this could take a while)
  • Step 3: Once downloaded, the information will be pickled (saved) into a file and on the next start, you don’t have to download all the pages again. Neat!
  • Step 4: Enter search terms and all the pages containing them will be returned.

In my case, downloading 1300+ pages takes indeed a while.

After finishing downloading, it generates a ~100 MB pickle:

This pickle will then be used when the programm is restarted.

Searching some terms:

Found existing library, reading.
Enter search terms (quit with q): python threads
Enter search terms (quit with q): python processes http://wiki.python.org/moin/PersistenceTools http://python3porting.com/improving.html http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html http://news.ycombinator.com/item?id=3120380 http://www.doughellmann.com/PyMOTW/multiprocessing/basics.html http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies http://morepypy.blogspot.de/2012/08/multicore-programming-in-pypy-and.html http://docs.python.org/2/library/multiprocessing.html
Enter search terms (quit with q): multiprocessing http://python3porting.com/improving.html http://www.doughellmann.com/PyMOTW/multiprocessing/basics.html http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies http://morepypy.blogspot.de/2012/08/multicore-programming-in-pypy-and.html http://docs.python.org/2/library/multiprocessing.html


Python code (Python 2.7, you’ll need html2text):

'''Downloads your bookmarked pages and makes them full-text searchable.'''
__author__ = "Adrianus Kleemans"
__date__ = "December 2014"
import os import os.path import pickle from urllib2 import * import html2text import HTMLParser
def download(url): f = urlopen(url, timeout=5) data = f.read() f.close() charset = f.headers.getparam('charset')
if charset is not None: try: udata = data.decode(charset) data = udata.encode('ascii', 'ignore') except (UnicodeDecodeError, LookupError): print 'Error decoding with charset=', charset
return data
def download_library(library_pickle): library = {} f = open('bookmarks.html', 'r') bookmark_file = f.read() f.close()
bookmarks = bookmark_file.split('A HREF="') del bookmarks[0]
for i in range(len(bookmarks)): bookmarks[i] = bookmarks[i].split('"')[0] print 'Found', len(bookmarks), 'bookmarks.'
# download bookmarked pages for bookmark in bookmarks: if bookmark not in library: print 'Downloading', bookmark[:50], '...' txt = '' try: txt = download(bookmark) txt = html2text.html2text(txt) except Exception, e: print 'Error:', e library[bookmark] = txt
pickle.dump(library, open(library_pickle, 'wb'))
def main(): library_pickle = 'bookmarks.p' if os.path.exists(library_pickle): print 'Found existing library, reading.' else: print 'No library found. Downloading, this could take a while...' download_library(library_pickle) library = pickle.load(open(library_pickle, 'rb'))
# main search loop while True: keywords = raw_input('\nEnter search terms (quit with q): ') keywords = keywords.split(' ') if keywords == ['q']: break for bookmark in list(library): is_candidate = True for keyword in keywords: if keyword not in library[bookmark] and keyword not in bookmark: is_candidate = False if is_candidate: print bookmark print 'Finished.'
if __name__ == "__main__": main()


« Older