xml.etree.ElementTree and OS randomness
OK, I’ll start by doing something you aren’t supposed to do: parse HTML using a bunch of regular expressions in Python. Don’t do this. Life is too short. Use something like BeautifulSoup. I couldn’t, because I had to use standard libraries only, so here I am.
Anyway, my program manipulated a tree of XHTML using xml.etree.ElementTree
. Typical stuff like any tutorial on it:
import xml.etree.ElementTree as ET ... root = ET.parse(htmlfname).getroot() bod = root.find('body') ...
Then after a bunch of scraping, I used a regex like this to get the address in the href of the link:
anchor_addr = re.search("<a href=\"([^\"]*)\"",anchor_base).group(1)
(I’m converting HTML to Markdown. Don’t ask why.)
Anyway, this worked fine on Windows. I pushed the code, and a Unix machine in the CICD pipeline built it, and failed on this line. The match would fail, and anchor_addr
wouldn’t have a .group(1)
, because it was a NoneType
.
At first, I thought it was the typical problem with linefeeds, like my code was splitting strings into arrays with \n
and on Windows it had \r\n
. After messing around with that, I found it it wasn’t the case.
Here’s the problem: xml.etree.ElementTree
uses a dictionary to store the attributes of an element it parses. Python dictionaries are inherently unordered. Or they can be unordered; it’s an implementation detail. And it looks like the version of Python I was using on Windows was ordering them, but the ones I was using on my home Mac and on this unix build machine were returning the attributes alphabetically. So <a href="foo" alt="hi">
was becoming <a alt="hi" href="foo">
and breaking my regexp.
My code didn’t really need to find the entire element and pull the value of the attribute, because it was already inside the element. So I was able to change that regexp to "href=\"([^\"]*)\""
and that worked, provided it was never HREF
or HREF='foo'
.
Long story short, don’t use regular expressions to parse HTML.