xml.etree.ElementTree and OS randomness

OK, I’ll start by doing something you aren’t supposed to do: parse HTML using a bunch of regular expressions in Python. Don’t do this. Life is too short. Use something like BeautifulSoup. I couldn’t, because I had to use standard libraries only, so here I am.

Anyway, my program manipulated a tree of XHTML using xml.etree.ElementTree. Typical stuff like any tutorial on it:

import xml.etree.ElementTree as ET
...
root = ET.parse(htmlfname).getroot()
bod = root.find('body')
...

Then after a bunch of scraping, I used a regex like this to get the address in the href of the link:

anchor_addr = re.search("<a href=\"([^\"]*)\"",anchor_base).group(1)

(I’m converting HTML to Markdown. Don’t ask why.)

Anyway, this worked fine on Windows. I pushed the code, and a Unix machine in the CICD pipeline built it, and failed on this line. The match would fail, and anchor_addr wouldn’t have a .group(1), because it was a NoneType.

At first, I thought it was the typical problem with linefeeds, like my code was splitting strings into arrays with \n and on Windows it had \r\n. After messing around with that, I found it it wasn’t the case.

Here’s the problem: xml.etree.ElementTree uses a dictionary to store the attributes of an element it parses. Python dictionaries are inherently unordered. Or they can be unordered; it’s an implementation detail. And it looks like the version of Python I was using on Windows was ordering them, but the ones I was using on my home Mac and on this unix build machine were returning the attributes alphabetically. So <a href="foo" alt="hi"> was becoming <a alt="hi" href="foo"> and breaking my regexp.

My code didn’t really need to find the entire element and pull the value of the attribute, because it was already inside the element. So I was able to change that regexp to "href=\"([^\"]*)\"" and that worked, provided it was never HREF or HREF='foo'.

Long story short, don’t use regular expressions to parse HTML.