Static Pages

Migrating from Textpattern to Hugo

October 19, 2020

Over the last few weeks, I've been working to switch my Blog from Textpattern to Hugo, and migrating the content (mainly my posts) to Markdown.

Unfortunately, Textpattern does not provide any export in a convenient format - There are ways to export a DB dump as backup or plugins for exporting data, but nothing out of the box.

So I decided to dump my posts from the database myself and I'd like to show how.

Find connection params

If you're struggling to find the credentials to your Textpattern database like me, it's best to have a look in the configuration file, config.php:

$ cat textpattern/config.php
<?php
    $txpcfg['db'] = 'textpattern';
    $txpcfg['user'] = 'txpuser';
    $txpcfg['pass'] = '###';
    $txpcfg['host'] = 'localhost:3306';
    $txpcfg['table_prefix'] = '';
    $txpcfg['txpath'] = '/var/www/kleemans.ch/textpattern';
    $txpcfg['dbcharset'] = 'utf8';
    $txpcfg['siteurl'] = "https://kleemans.ch";
    define('PROTOCOL', 'https://');
?>

Connect to MySQL database

After getting the connection parameters, we can connect to the database:

$ mysql --host=localhost --user=txpuser --password=###

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 500
Server version: 5.7.27-0ubuntu0.16.04.1 (Ubuntu)
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
...

After this we tell MySQL which database we want to use...

use textpattern

... and we can have a look at the tables:

mysql> show tables;
+-----------------------+
| Tables_in_textpattern |
+-----------------------+
| rah_sitemap           |
| rah_sitemap_prefs     |
| textpattern           |
| txp_category          |
| txp_css               |
| txp_discuss           |
| txp_discuss_nonce     |
| txp_file              |
| txp_form              |
| txp_image             |
| txp_lang              |
| txp_link              |
| txp_log               |
| txp_page              |
| txp_plugin            |
| txp_prefs             |
| txp_section           |
| txp_token             |
| txp_users             |
+-----------------------+
19 rows in set (0.00 sec)

This gives back an overview of the whole Textpattern system. Parts of it are explained in the Textpattern documentation: Textpattern Schema reference.

I only found the schema for the newest version (4.9), but also with my version (4.6.2) the schema was the same.

Exporting content

Next up is actually exporting/dumping the blog content. I tried this several ways, but ended up with the simplest one.

I tried making SELECT * INTO OUTFILE work, but this needs elevated rights (admin) and even after getting admin access I still wasn't able to export, due to the server running in a mode where exports wouldn't be allowed.

So I stuck with the most simple export (actually the first one I tried):

mysql --user=txpuser --password textpattern -B -e "select * from textpattern;" > dump.tsv

The downside was that it was not a "proper" tsc/csv, as one post would span multiple lines without the usual quoting char, like the following example. The line would start with post ID, date, author etc., separated by a proper tab, but the content itself would be on new lines.

851 2020-09-21 06:33:46 NULL    adrianus    2020-09-21 06:33:46 adrianus    Automate downloading songs from youtube     Streaming [...]
\n
\n![](/images/420.png)
\n
\n## Getting song charts
...

Other approaches like this example with regex would allow for a "," delimiter, but the post would still be spread across multiple lines.

Generating posts from dump

After having an export in my hands, the rest was quite simple. I just needed to map the columns to attributes (for example, the 6th column was is the title), and generating files from the content.

Note that because of posts span several lines, we can't just assume one post per line, and the lines are "stitched" together until we recognize the beginning of a new post.

# Read dump & concat lines
header_row = True
raw_posts = []
raw_post = ''
first_row = True
with open('dump.tsv', 'r') as dump_file:
    for row in dump_file.readlines():
        if row.startswith('\\n') or first_row:
            raw_post += row
            first_row = False
        else:
            raw_posts.append(raw_post)
            raw_post = row
print('Read', len(raw_posts), 'raw posts.')

# Build markdown posts
posts = []
for post in raw_posts:
    row = post.split('\t')
    posted = row[1]
    title = row[6]
    body = row[8].replace('\\n', '').replace('\\t', '  ')
    category1 = row[13]
    category2 = row[14]
    status = row[18]
    section = row[21]
    keywords = row[23]
    url = row[25]
    content = '---\ntitle: ' + title + "\n"
    content += 'date: ' + posted + "\n"
    content += 'category: ' + category1 + "\n"
    content += 'keywords: ' + keywords + "\n"
    if status not in ('4', '5'):
        content += 'draft: true\n'
    content += '---\n\n'
    content += body
    filename = posted.split(' ')[0] + '_' + url + '.md'
    post = { 'filename': filename, 'content': content }
    posts.append(post)

# Write files
for post in posts:
    with open('posts/' + post['filename'], 'w') as post_file:
        post_file.write(post['content'])

print('Finished, wrote', len(posts), 'posts.')

I still had to do some manual cleanup after that (including converting Textile hyperlinks to Markdown), but still, a basic export and creation of a file per post saved a lot of time.

Thanks for reading!