Shared Del.icio.us Accounts with Python 4

Posted by timgoh
on Wednesday, October 31

syncdelish is a little script I wrote that takes links in a del.icio.us user’s ‘links for you’ section and adds them to his bookmarks.

Basically if you run this script on crontab you automatically bookmark any links users have tagged as “for:”. You can use this to provide link-sharing via a single communal account for a group of people, or simply to auto-add links people send your way.

To use this for a shared account, just create a new account ie “deli”. Have everyone add this account to their del.icio.us network, and then just tag links you want to share with “for:deli”. Those links will be added to deli’s bookmarks whenever the script is run, so you will be able to access them easily1.

Script and usage

Download the script here (BSD licensed2)

Rename as you like and run with Python. Usage is:
syncdelish USERNAME PASSWORD FOR_YOU_FEED

Where FOR_YOU_FEED is the link to RSS at the bottom of the ‘links for you’ page. It looks something like ‘http://del.icio.us/rss/for/username?private=[longhash]’.

Known Issues

‘Links for you’ seems to be an incomplete del.icio.us feature. There are limitations with it as follows:

  • these links cannot be deleted by you
  • even if the original user who tagged it removes the for: tag, it still is not removed
  • there is no differentiation in the private feed for saved links and unsaved links
  • the private ‘links for you’ feed is protected via security by obscurity – no authentication needed, just relies on people not guessing the correct hashcode.

Also, the del.icio.us posting API has only two return codes: success and failure. There is no notification of “failure because duplicate”. Hence when a previously saved link is encountered, the script wrongly outputs failure. It doesn’t break, it just mistakenly tells you that this link could not be added instead of saying it already existed.

Basically, currently it seems that unless you are a del.icio.us developer you have no way of removing something tagged as for you. This leads to an entire more serious problem…

It means my script gets slower every time a link is added, since the list of links to copy only grows and never shrinks. Once the number of shared links gets to a reasonable size, the script is all but unusable, since my script does not know which is a duplicate.

There are a few ways I can resolve this:

  1. Scrape from HTML instead of XML (the link after toggles between ‘saved’ and ‘save this’). Rejected because that’s not the right way to do it.
  2. Log processed links in a file and skip over them if encountered. Rejected because one of my chosen constraints for this was implementing it with a single script.
  3. Take a ‘last_time_run’ or ‘frequency’ type optional parameter that skips all links added before the time given. This is still not ideal because the full list of for: links still has to be downloaded and processed. In other words it’s a hack. So, rejected for now, but if the status quo remains I will add in this feature.
  4. Whine to del.icio.us feedback about the limitations with ‘for you’ links, and then write a blog post whining some more.

As you already know, I picked option 4 for now. Do check back on this page for any updates.

Background and Credits

Thanks to Simon Willison I came across Natalie Downe’s Snafflr script. I liked the idea, but I wanted to implement my own version using del.icio.us’s native “for:” tag.


1 Links tagged with “for:” can only be accessed by the user account they are shared with.

2 Only reason I chose BSD over MIT is that the former has the name of my alma mater while the latter was too expensive for me to attend.

Using database views with Django models 0

Posted by timgoh
on Wednesday, October 17

I’m surprised this feature isn’t promoted more (maybe because it involves raw SQL?). You can hook a Django model to a database view. It can come in very useful when you’re trying to aggregate common information from various models.

Problem to solve

Say I have the following models: Album, Artist, and Dvd on a music site that accepts submissions. Before submissions are shown on the site they must be approved. So each of them has a boolean saying whether they are approved or not, something like:

approved = models.BooleanField(default=False)

Let’s complicate the situation a little to make for a better example. Say, the equivalent field for the Dvd model is called “not_unapproved1”. Also, Albums and Dvds have a foreign key to record label2, something like:

record_label = models.ForeignKey(EvilRIAAMember)

Our goal is to have a model called Unapproved which aggregates the Album, Artist, Dvd objects that have not been approved. It should provide access to the ‘approved’ and ‘record_label’ fields. Then I can placate my site editors who are clamoring for an easy way to find all unapproved content regardless of content type.

Step 1: View definition

UNAPPROVED_VIEW = """ 
CREATE OR REPLACE VIEW unapproved_view AS
  SELECT
    (SELECT nextval('unapproved_sequence')) as id,
    o.id AS object_id,
    ct.id AS content_type_id,
    o.approved AS approved,
    o.record_label_id AS record_label_id
  FROM
    album as o,
    (SELECT id FROM django_content_type where model='album') as ct
  WHERE
    NOT o.approved
UNION
  SELECT
    (SELECT nextval('unapproved_sequence')) as id,
    o.id AS object_id,
    ct.id AS content_type_id,
    o.not_unapproved AS approved,
    o.record_label_id AS record_label_id
  FROM
    dvd as o,
    (SELECT id FROM django_content_type where model='dvd') as ct
  WHERE
    NOT o.not_unapproved
UNION
  SELECT
    (SELECT nextval('unapproved_sequence')) as id,
    o.id AS object_id,
    ct.id AS content_type_id,
    o.approved AS approved,
    NULL AS record_label_id
  FROM
    artist as o,
    (SELECT id FROM django_content_type where model='artist') AS ct
  WHERE
    NOT o.approved
""" 

Put that definition in app/models/unapproved.py, which is where the model will be going as well.

Those familiar with db views should probably skip on ahead to the next section.

Ok, what this gives you is a view called ‘unapproved_view’ which has the following fields:

  • id
  • object_id
  • content_type_id
  • approved
  • record_label_id

Since artists do not have a record label foreign key, we specifically NULL it out. Note that we are only selecting rows that are not approved. If you’re trying this out with your own models, remember that for a UNION to work you need to have matching types.

The “nextval(‘unapproved_sequence’)” part is just to give each row in the unapproved_view a unique ID.

You may be tempted to optimize by replacing (SELECT id FROM django_content_type where model='artist') AS ct with a hard-coded content_type_id value, but that is much less safe. I’d rather have the overhead of those extra queries and not have to worry about content type ids changing on me.

Step 2: Model File

In app/models/unapproved.py:

  class Unapproved(models.Model):
    content_type = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey()
    approved = models.BooleanField(default=false)
    record_label = models.ForeignKey(EvilRIAAMember, null=True)

    class Meta:
      db_table = "unapproved_view" 

This isn’t the whole file of course, fill in the rest: imports, app_label, etc.

Things to note:
  • this model has “null=True” for the record_label definition to cater for Artists not having one.
  • 0.96 and below should use models.GenericForeignKey instead of generic.GenericForeignKey

Anyway you now have a model file hooked to the view. You’ll have to create that view in your database first, of course (run the contents of the UNAPPROVED_VIEW variable from Step 1). Don’t forget about creating the sequence:

CREATE SEQUENCE unapproved_sequence START 1;

Step 3: Use the darn thing already

Now if you add the necessary “class Admin” boilerplate to the model, you’ll find that you can access this model in the Django admin. While you can’t view individual objects and they’re read-only, you do get a handy list view.

You can use this model in (Django) views and templates though. You can now do things like Unapproved.objects.filter(...). And with the GenericForeignKey you have easy access to the actual content object. Writing a view that displays all unapproved objects and lets user change the “approved” value to True is trivial once you have this set up, so I won’t get the details of that.

Hope the above explanation helped. This is a very useful technique since it allows you to practice DRY and handle common properties and behavior across your models in a single place.


1 Turns out the coding for Dvd model was outsourced to someone whose native language favors double negatives.

2 Yes, wise guys, in real life artists have record labels too, but since these can possibly change over time, for the purposes of this demo they don’t!

Inside Django's Template Inheritance

Posted by timgoh
on Thursday, October 11

Users and abusers of Django’s nifty template inheritance may have found they are unable to do the following in a child template (one that uses the ‘extends’ templatetag):

{% block stuff %}
  {% if cond %}
    {% block foo %} override foo {% endblock %}
  {% else %}
    {% block bar %} override bar {% endblock %}
  {% endif %}
{% endblock %}

What ends up happening is that both ‘foo’ and ‘bar’ blocks in the base template are overridden.

If you look at the Django source for the ‘block’ and ‘extends’ tags it is easy to see why this doesn’t work.

The render() function of the ExtendsNode class rips out every single block inside it and propagates it “up” to the base template by recursively replacing blocks in parent templates until the base template is reached. This takes place before the contents of block nodes are rendered.

The main consequences of this:

  1. Block nodes exist outside of flow control, as the snippet above shows
  2. Template rendering is sped up. All non-extends tags are rendered only once – when the base template is rendered.

Assume we really want to support the snippet I posted. This would result in a massively complex implementation, and sentences like

“Since template inheritance is somewhat similar to class inheritance, we need to implement dynamic dispatch of blocks via virtual function table. To populate the table…”

in the Django mailing list.

I think the current decision is the correct choice. It’s better to have a lightweight rendering engine and put the onus on the developer to get it right.

Here’s another fun thing to try before I go:

base template:
{% block outer %}
  I'm the outer block 
  {% block inner %}And I'm inside!{% endblock %}
{% endblock %}
child template:
{% block inner %}
  I am so drunk that
  {% block outer %}
    I reversed the blocks
  {% endblock %}
{% endblock %}

Try it!

Would you want the template rendering engine to include code to catch this and any of the myriad other gotchas compiler writers love so much? And have all that error-checking code chugging away in the background with every page you render? I think not.

Dreaming in Code, at last 0

Posted by timgoh
on Monday, April 30

So this is my obligatory “Dreaming In Code – been there read that” post. Sure it’s a little late (it seemed almost like a meme on Planet Python, swap “5 things that people don’t know about you” with “my opinions on DIC”). But I did get around to it, and I have rather mixed feelings upon finishing.

Ok, we all know from the expression what opinions are like, so excuse me while I show you mine.

The Good

  1. It’s fun to read about a team using Python. The book mentions Twisted, Zope, and other Python libraries and luminaries so much so that after a while it’s almost like celebrity spotting. “Hey I know those guys!”
  2. It’s reassuring in a rather schadenfreude-ish way to find out even the good programmers can fail. Software IS hard. I loved the bit about their issues with calendars and allowing users to input recurring events, because we have the same problems at my work projects. We do have ways to tackle those, but they become a hassle to work with at times. Good to know we’re not alone.
  3. There is a lot of dumbing down of technical details that most developers will know anyway. No I’m not making a mistake – this goes in the good column – and here’s why: many developers can’t document or explain themselves well. You can learn from Rosenberg’s writing. Instead of glossing over those sections, read them, but not to learn about X technical detail you already know. Read them as a good example of how to express X in simple terms.

The Bad

  1. If a book had to be written about a late, overbudget project, did the language used have to be Python? More on this later.
  2. Most developers won’t learn too much from DIC. Those who have read the “right” books/blogs will not gain any new insights. If you treat this as a book to read to improve yourself as a developer, you may be disappointed. Treating the book as a leisurely read would be much better.

In conclusion

DO read this book. It’s the kind of light fluffy writing you can comfortably fit between the technical stuff you normally read, especially if it’s books like “3D Graphics Programming for Math Savants” or “Teach Yourself Embedded Linux Systems in 21 days”.

DO recommend this book to a loved one so that they can have a better idea of what you do. Just don’t try to give it to them on a special day – that will not work out. “Hi honey, happy Valentine’s! I got you a book about software development that will help you understand me better…”

DON’T let your boss get ahold of this book if he’s a PHB and there’s projects at your workplace crying out for a healthy dose of Python. He’ll skim through the book, and for the rest of his life associate Python with a failed project.

Fun with Feisty and Python 2.5 3

Posted by timgoh
on Saturday, April 14

EDIT (May 18, 2007):

Ok I noticed from my awstats that a decent percentage of my traffic comes from looking for this problem, some through Google and others through a thread from the Ubuntu forums. I did not expect my quick hack to be used by so many others (honestly at the time I didn’t even know if it would work for me in the long term), but hey whatever works.

My original post was written as a “hey look what I found” and not as a “here’s how you fix this problem”, so I’m posting a much more succinct version. (The past two paragraphs don’t count!)

Who this fix is for: This fix is if you need python 2.4 as your default Python version in Feisty.

Those who familiar with Linux may have already tried symlinking /usr/bin/python to /usr/bin/python2.4, but get this error message:

ValueError: the symlink /usr/bin/python does not point to the python default version. it must be reset to point to python2.5

Here’s how to downgrade your default version in Feisty to Python 2.4 without those errors. Be very aware of my disclaimer at the end of the post before you follow any of this.

1. If you haven’t already, symlink /usr/bin/python to /usr/bin/python2.4

sudo ln -s /usr/bin/python2.4 /usr/bin/python

2. Edit your /usr/share/python/debian_defaults/ file

sudo vim /usr/share/python/debian_defaults

The initial few lines should read:

[DEFAULT]
# the default python version
default-version = python2.5

3. Change the third line to read:

default-version = python2.4

That’s all.

Once again, please read the disclaimer at the end of this post before you try this. I have been developing in Python and running Python apps on my system for a month with this fix, and have not encountered any problems yet.

Now, a favor to ask. If this works for you, please leave a comment saying it did. If it doesn’t, please leave a comment saying so with the appropriate error messages.

This way future people who are directed here can be more (or less) confident about this hack. Wouldn’t this post have been more valuable to you when you came if there were a bunch of affirmations and fixes for certain situations?

Call me a comment whore if you like, but comments here would benefit everyone who comes here looking for a solution. Or comment on the Ubuntu official forums thread .

Original post follows…

Previous Post:

(I would subtitle this “The journey is worth more than the destination”)

So Feisty comes with Python 2.5.

Now that’s not a problem right? Just symlink your /usr/bin/python2.4 and switch when necessary.

Except switching is far too excessive – “sudo apt-get upgrade” had dpkg crapping out on me saying that python was not symlinked to python2.5. Since my current projects are not 2.5-ready, I would have to swap back and forth a lot. This is particularly annoying since the problem package in my case, “update-manager-core”, does not require 2.5 to work.

I tried setting up update-alternatives to ease this a little:

sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.4 2
sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.5 1
sudo update-alternatives --config python

This method does not save that many keystrokes from manually symlinking as necessary, but it’s easier and has less room for error I suppose.

However, this does not work. update-alternatives will symlink /usr/bin/python to /etc/alternatives/python (which is symlinked to the appropriate python version in /usr/bin) , which will cause /usr/share/pycentral-data/pyversions.py to choke (it is hardcoded to check for a symlink in the /usr/bin directory1)

/usr/share/pycentral-data/pyversions.py lines 129-130:

if not _default_version in (debian_default, os.path.join('/usr/bin', debian_default)):
    raise ValueError, "the symlink /usr/bin/python does not point to the python default version. It must be reset to point to %s" % debian_default

While poking around here, I noticed that the default version is set in /usr/share/python/debian_defaults/

relevant snippets from /usr/share/pycentral-data/pyversions.py:

config.readfp(file('/usr/share/python/debian_defaults'))
_defaults = config
value = _defaults.get('DEFAULT', name)
### name == 'default-version' in this circumstance

So I merrily made my way to the debian_defaults file and set the default version to python2.4. Double-checked with Debian Python Policy to make sure I wasn’t doing anything catastrophic. The relevant details are in section 1.3.1, and from my understanding of it, the change I made is safe.

I verified this change against my original problem – dpkg throwing an error when upgrading/installing/removing “update-manager-core” with /usr/bin/python symlinked to /usr/bin/python2.4. Problem solved without having to switch the symlink to python2.5. So this change has fixed a bigger problem than I was originally trying to solve – I now cut down the amount of symlink switching instead of making switching more convenient!

This is a huge hack though. I wouldn’t recommend it. For one thing, my “python” package (like any default Feisty install) is for Python 2.5, which conflicts with my manual change to debian_defaults. I do intend to keep this change for now but revert the moment I notice any strange behavior.


1 That is a bug for some other time—surely that code needs to include the possibility of update-alternatives being used. It’s not like that’s some obscure corner case.

Django: When save() is not safe 0

Posted by timgoh
on Friday, March 02

I’m looking forward to a stable version of newforms in the next Django release. The old forms and manipulators system, while decent, was the most painful part of Django to use—it seemed out of place amongst the rest of the framework.

Recently I ran into a situation which is not that out of the ordinary, but which the current manipulator-based admin make it very difficult for me to do.

Background: A “Book” has a main_category field and a categories field, with the former providing the category slug used in the url. The main_category field is a foreign key to the Category model, and categories has a many-to-many relationship with Category.

Objective: In order to make searching one line of code, copy the category chosen in main_category to categories upon save. That simplifies queries for Books based on category to a single books = desired_category.book_set.all() call. (Without the save I would have to OR this with Books.objects.filter(main_category=desired_category)).

So, sounds pretty straightforward to those with any experience in Django—modify the save() function in Books to add whatever is in main_category to categories as follows:


  self.categories.add(self.main_category)
  super(Book, self).save() 

This works perfectly in ipython.

However, when I tested this through Django’s admin interface, it did not work upon save, ie main_category is not copied over.

This is because saving in admin goes through the AutomaticManipulator. Ample pdb usage found the code in question in db/models/manipulators.py. I’ll paste the relevant extracts here, comments enclosed in ### are my annotations.



def save(self, new_data):
    ### [ snipped ] ###
    # First, save the basic object itself.
    new_object = self.model(**params)
    new_object.save()
    ### post_save hook kicks in at the end of save() above ###

    ### [ snipped ] ###
    # Save many-to-many objects. Example: Set sites for a poll.
    for f in self.opts.many_to_many:
       if self.follow.get(f.name, None):
           if not f.rel.edit_inline:
                if f.rel.raw_id_admin:
                    new_vals = new_data.get(f.name, ())
                else:
                    new_vals = new_data.getlist(f.name)
                # First, clear the existing values.
                rel_manager = getattr(new_object, f.name)
                ### Sacre bleu! what is this? ###
                rel_manager.clear()
                # Then, set the new values.
                for n in new_vals:
                    rel_manager.add(f.rel.to._default_manager.get(pk=n))


Ok, here’s what happens.

  1. My overridden save method gets called in new_object.save()
  2. rel_manager deletes everything that used to be in categories with rel_manager.clear() (found in db/models/fields/related.py

The post_save hook does not help here, since it kicks in at the end of new_object.save().

Essentially, the code for saving many-to-many objects works by deleting what is already there and adding whatever was included in the form. All my overriding of the save() function in Book managed to do was provide one more category to be deleted.

Now I’m not saying this is the wrong thing to do—from the admin’s point of view, the correct thing to do is ignore whatever was in categories before, and only look at what has been selected in the form field. It is the right thing to do within the current application structure, imo.

However, when that application structure limits me from doing something that should be fairly straightforward (adding to a many2many field programmatically in the admin), then maybe there is some room for improvement.

I haven’t had the time to have a comprehensive look at newforms yet, but my brief skimming of related posts on newforms leads me to think that it will eventually be a change near the same magnitude of magic-removal. Just as I am glad I only came across Django post magic-removal, I’m sure future users of Django will be glad they didn’t have to deal with oldforms—such is the improvement newforms will provide. Those interested in getting a headstart can find an excellent intro to newforms here.

Managing Django's Managers 0

Posted by timgoh
on Thursday, February 15

Managers in Django are perhaps a misnomer, because unlike the stereotypical PHB they are flexible and provide a very efficient way of doing things.

But while Managers are very powerful, they can cause problems if you aren’t familiar with the way they work behind the scenes. Hopefully this post will help you avoid a problem I ran into.

Some time ago our client wanted to change the way a model worked—they wanted certain objects to be invisible on the front end but still remain present on the back-end admin app.

Using a custom manager, you can accomplish that without changing the hundreds of ORM calls you have in your code. In this situation, say we want to filter out all Albums that aren’t in stock—they should never be shown in the front end. Now there are probably lots of queries like this all over your code1:
Album.objects.filter(category__title="Metal")

Now, going through all such calls and adding a “in_stock=True” to every filter would be a very painful task. It can be automated with sed, but it causes a lot of unnecessary repetition.

So you can just define a custom Manager:

class AlbumInStockManager(models.Manager):
    def get_query_set(self):
        return super(AlbumInStockManager, self).get_query_set().filter(
            in_stock=True)

And use that manager in your model:


class Album(models.Model):

    objects = AlbumInStockManager()

And voila! All queries using Album.objects will now only return albums where in_stock is true.

However, this has a side effect2:

If you use custom Manager objects, take note that the first Manager Django encounters (in order by which they’re defined in the model) has a special status. Django interprets the first Manager defined in a class as the “default” Manager. Certain operations – such as Django’s admin site – use the default Manager to obtain lists of objects, so it’s generally a good idea for the first Manager to be relatively unfiltered.

Actually, make that “completely unfiltered”. The reason is explained in Ticket 1855, “Using a custom default manager can lead to un-editable objects in admin” :

This is because the admin change_stage view uses the automatic ChangeManipulator for the model, which in turn uses the model’s default manager to fetch the object to change. So if the model’s default manager happens to filter in a way which excludes that object from the returned QuerySet, the ChangeManipulator will raise ObjectDoesNotExist, which in turn causes the admin to return a 404.

Our declaration of a custom manager has resulted in overriding the default manager and breaks part of admin functionality due to this bug.

So the simple solution here is to keep “objects” as the default manager, and use a different manager for the front-end Album queries:


objects = model.Manager() # default manager
in_stock = AlbumInStockManager() # custom manager

However, this means that all previous Album.objects calls in the front-end would have to be change to Album.in_stock calls.

So, how do you have your cake and eat it too?

The correct solution for our requirements (keep Admin untouched and use the name “objects” for our custom manager):


    # define all_objects first
    # so it becomes the default used by admin
    all_objects = models.Manager() 

    objects = AlbumInStockManager() 

But we’re not out of the woods yet. Calls to Django helper functions like get_object_or_404 use the default manager too.

So, in the front-end where you only want in-stock Albums to be displayed, you have to change your calls:


    # WRONG - uses models.manager()
    one_hit_wonder = get_object_or_404(Album,
        title="Who Let The Dogs Out")

    # RIGHT - uses objects = AlbumInStockManager  
    one_hit_wonder = get_object_or_404(Album.objects,
        title="Who Let The Dogs Out")

(This works as of revision 4275 which lets get_object_or_404 take Manager objects)

To recap

  1. The built-in admin app and various helper functions use the default manager
  2. The admin app will not allow editing of objects filtered out by the default manager
  3. This default manager will be:
    • the generic one if no managers are defined
    • the first defined manager if one or more managers are defined

Now I do not really like the idea of a convention, documented or no, where the first defined manager automagically becomes the default manager. Hopefully in the future there is a much clearer way of specifying the default manager regardless of where it appears in the class definition.

For more information, Ticket 1855 contains a highly informative discussion.


1 “objects” is the default Django manager that gets included with every class. It yields querysets that perform no filtering, ie return all rows

2 from model documentation

Back to School 0

Posted by timgoh
on Friday, December 15

The Fall 2006 semester of Berkeley’s Programming Languages and Compilers class has ended, and the lucky guys who took it this semester got a nice up close and personal look at Python.

From the course description:

One goal of this course is to explore the structure of programming languages and to consider alternatives to familiar programming language features. We’ll also study the problem of translating programming languages into machine-executable forms, using Python as a concrete example of a language to be translated, and the assembly language of the Intel ia32 family (used in PCs and some of our Solaris workstations downstairs) as a concrete example of a target machine. We study language translation first to learn some of techniques used that are useful for many programming problems outside of language translation, second to gain a better intuitive feel for the tools we use when programming and the costs of the programs we write, and third (possibly most important) to gain experience with the engineering problems associated with building and validating a substantial piece of software.

Man, that is too cool, given how widely Python is used today. When I took this course, the language we studied and used for our projects was called COOL , which stands for Classroom Object-Oriented Language. Pretty decent language, but not exactly the language that amongst other things, is the primary language of choice for Youtube. The jokes about Cool’s name being a misnomer started to wear thin a couple weeks into the semester .

I noted with interest that this incarnation of CS164 is taught by Prof Hilfinger, who has a reputation amongst computer science students at Cal as having very challenging courses and exams. Which means many GPA-conscious students (the kind who pull out all stops to protect their honors GPA) would avoid classes he taught. I don’t know if this perception has changed. Would be a pity if anyone avoided this course primarily for this reason1.

But hey, those of us who aren’t taking the course can benefit somewhat too, through the online lecture notes, which are just a “wget -r” away from the course site. The notes are mostly language-agnostic, as they should be! The pythonic part comes mainly from the projects, which are to implement phase by phase a compiler for Pyth, a dialect of Python. I don’t think I have the time to redo a project I did close to four years ago (albeit for a different language). But for the intrepid folk who do take a shot at it, it should be quite an experience!


1 I never had the privilege of taking a course under Prof Hilfinger myself, but I assure you this is more a case of coincidence due to my path of progression through the department and the courses he happened to be teaching each semester.

Handling Legacy Data 2

Posted by timgoh
on Sunday, December 03

So, I’ve spent the better part of the past couple days trying to import legacy data into our brand new spiffy models.

And trust me, shoehorning 5-6 years’ worth of inconsistent data into new models is not fun. I don’t begrudge those in the data mining industry the bundles of cash they must make—this is one tedious PITA task.

The worst thing about a one-off migration is that you don’t learn that much from it. True, I made mistakes and learned from them principles that I can reuse if I ever suffer through a similar task, but for the most part it is very domain-specific knowledge that I will not get to practice again.

One part of this arduous process was converting from HTML to plaintext. Straightforward enough. The previous article bodies contained a lot of entities like our good friend  , so right off the bat your standard naive regular expression stripping is not feasible1. BeautifulSoup would have worked, but it kept segfaulting on me. So my next attempt was to roll my own:


def html2plaintext(html):
    out = StringIO()
    formatter = AbstractFormatter(DumbWriter(out, 200))
    parser = htmllib.HTMLParser(formatter)
    parser.feed(html)
    parser.close()
    return out.getvalue()

Now this is very cleaned up—prior to this I was using a file object instead of StringIO2. But it works, with a little mucking around with decode() and encode(). This result, however, was not completely optimal, since what was originally formatted decently with HTML turned into an unattractive chunk of plaintext3.

I’ve saved the real gem of this entire exercise till the end—a little library with the highly original name html2text. It has the benefit of not just converting to plain text, but to valid Markdown as well. Perfect for our purposes, since we use Markdown with django.contrib.markup4. Oh, and did I mention it only took one line? (My code, not the html2text code – this is Python, not Perl )

This time round I had results I was completely satisfied with.

This episode only served to remind me how rich Python’s libraries are. For my task I had the option of building off Python’s standard libraries, or deciding between external libraries such as the more general BeautifulSoup and the lesser-known library that satisfied all my task requirements – html2text. Python users are seriously spoilt for choice, and that’s what makes Python a joy to work with.


1 Indeed it is almost never feasible, due to the irregularity of most HTML you find on the web. Cue gratuitous usage of jwz’s famous quote: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems. “

2 When I write one-off code, it sometimes gets ugly enough to make the Daily WTF because I become a lot less concerned about maintainability in the name of Getting Things Done (I suppose this phrase is trademarked by Spolsky?)

3 The <p>s and <br>s were turned into newlines of course, but the rest of the formatting was lost.

4 Unfortunately this nomenclature is confusing the first couple times you run into it. You have to {% load markup %} to use the {{ content|markdown }} filter. But I understand where it comes from—the markup app covers not only Markdown but Textile and ReST (reST? rEsT? 4357?)