Blog post

xml.etree.ElementTree and OS randomness

OK, I’ll start by doing something you aren’t supposed to do: parse HTML using a bunch of regular expressions in Python. Don’t do this. Life is too short. Use something like BeautifulSoup. I couldn’t, because I had to use standard libraries only, so here I am.

Anyway, my program manipulated a tree of XHTML using xml.etree.ElementTree. Typical stuff like any tutorial on it:

import xml.etree.ElementTree as ET
...
root = ET.parse(htmlfname).getroot()
bod = root.find('body')
...

Then after a bunch of scraping, I used a regex like this to get the address in the href of the link:

anchor_addr = re.search("<a href=\"([^\"]*)\"",anchor_base).group(1)

(I’m converting HTML to Markdown. Don’t ask why.)

Anyway, this worked fine on Windows. I pushed the code, and a Unix machine in the CICD pipeline built it, and failed on this line. The match would fail, and anchor_addr wouldn’t have a .group(1), because it was a NoneType.

At first, I thought it was the typical problem with linefeeds, like my code was splitting strings into arrays with \n and on Windows it had \r\n. After messing around with that, I found it it wasn’t the case.

Here’s the problem: xml.etree.ElementTree uses a dictionary to store the attributes of an element it parses. Python dictionaries are inherently unordered. Or they can be unordered; it’s an implementation detail. And it looks like the version of Python I was using on Windows was ordering them, but the ones I was using on my home Mac and on this unix build machine were returning the attributes alphabetically. So <a href="foo" alt="hi"> was becoming <a alt="hi" href="foo"> and breaking my regexp.

My code didn’t really need to find the entire element and pull the value of the attribute, because it was already inside the element. So I was able to change that regexp to "href=\"([^\"]*)\"" and that worked, provided it was never HREF or HREF='foo'.

Long story short, don’t use regular expressions to parse HTML.


Jonathan Konrath

Written by Jonathan Konrath, technical writer, docs manager, and publications engineer.

Jonathan Konrath
Technical writing, documentation systems, API docs, and assorted blog posts from Jonathan Konrath.