Making GameBoy manuals searchable via OCR

November 6, 2023

This post is part of the Game Boy series (#6).

TLDR: I created a searchable index of scanned PDF Game Boy manuals:

There has been an ongoing effort to preserve and collect GameBoy manuals, and many (although not all) are available today, mostly as PDF.

What I was missing though was to do a text search in them - search for a term across all manuals and then find manual and the passages where they occur.

Using OCR on a collection of manual PDFs

From multiple source like Kirkland's, dumps or, I found the most complete to be sprintinglegs' GameBoy manual list, so that's the one I used.

Once having all the manuals as PDFs, I decided to use Python to extract the text content. In a first step, the images are extracted page by page using pdf2image, and then subsequently scanned with OCR using pytesseract.

The content is then saved as JSON (for easy processing in a web app). Here is the code:

import glob
import json
import pdf2image
import pytesseract

def get_text(pdf_file):
    text = ''
    images = pdf2image.convert_from_path(pdf_file)
    for pg, img in enumerate(images):
        print(f'Working on page {pg}')
        text += pytesseract.image_to_string(img)
    return text

ocr_text = {}
files = glob.glob("manuals/*.pdf")
print(f'Found {len(files)} files.')
for file_name in files:
    print(f'Scanning {file_name}...')
    text = get_text(file_name)
    ocr_text[file_name] = text

with open('text-output.json', 'w') as write_file:
    write_file.write(json.dumps(ocr_text, indent=4))

Creating a simple Angular App

After having all the text content ready, I created a small Angular app to search through the text and show a list of occurences, grouped by the manual. The search term is "highlighted" with brackets ([ ]) and some context around the occurence (50 characters before and after) is also shown.

The list is sorted by the amount of occurences, so manuals with more occurences will appear higher:

You can try it here yourself!

Source code:

Thanks for reading and happy searching!