So, I’ve spent the better part of the past couple days trying to import legacy data into our brand new spiffy models.
And trust me, shoehorning 5-6 years’ worth of inconsistent data into new models is not fun. I don’t begrudge those in the data mining industry the bundles of cash they must make—this is one tedious PITA task.
The worst thing about a one-off migration is that you don’t learn that much from it. True, I made mistakes and learned from them principles that I can reuse if I ever suffer through a similar task, but for the most part it is very domain-specific knowledge that I will not get to practice again.
One part of this arduous process was converting from HTML to plaintext. Straightforward enough. The previous article bodies contained a lot of entities like our good friend , so right off the bat your standard naive regular expression stripping is not feasible1. BeautifulSoup would have worked, but it kept segfaulting on me. So my next attempt was to roll my own:
def html2plaintext(html):
out = StringIO()
formatter = AbstractFormatter(DumbWriter(out, 200))
parser = htmllib.HTMLParser(formatter)
parser.feed(html)
parser.close()
return out.getvalue()
Now this is very cleaned up—prior to this I was using a file object instead of StringIO2. But it works, with a little mucking around with decode() and encode(). This result, however, was not completely optimal, since what was originally formatted decently with HTML turned into an unattractive chunk of plaintext3.
I’ve saved the real gem of this entire exercise till the end—a little library with the highly original name html2text. It has the benefit of not just converting to plain text, but to valid Markdown as well. Perfect for our purposes, since we use Markdown with django.contrib.markup4. Oh, and did I mention it only took one line? (My code, not the html2text code – this is Python, not Perl
)
This time round I had results I was completely satisfied with.
This episode only served to remind me how rich Python’s libraries are. For my task I had the option of building off Python’s standard libraries, or deciding between external libraries such as the more general BeautifulSoup and the lesser-known library that satisfied all my task requirements – html2text. Python users are seriously spoilt for choice, and that’s what makes Python a joy to work with.
1 Indeed it is almost never feasible, due to the irregularity of most HTML you find on the web. Cue gratuitous usage of jwz’s famous quote: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems. “
2 When I write one-off code, it sometimes gets ugly enough to make the Daily WTF because I become a lot less concerned about maintainability in the name of Getting Things Done (I suppose this phrase is trademarked by Spolsky?)
3 The <p>s and <br>s were turned into newlines of course, but the rest of the formatting was lost.
4 Unfortunately this nomenclature is confusing the first couple times you run into it. You have to {% load markup %} to use the {{ content|markdown }} filter. But I understand where it comes from—the markup app covers not only Markdown but Textile and ReST (reST? rEsT? 4357?)

