How to make Python's ElementTree ignore lack of spaces between quotes and attributes?

Question

When I run

from xml.etree import ElementTree
tree = ElementTree.fromstring('<foo bar=""baz=""></foo>')

I get

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 11

This is due to the lack of space between "" and baz.

I'm encountering this problem in XML files provided to me by a third party.

Is there any way to make ElementTree be a little less pedantic about the spacing and parse it as if there was a space?

It's worth noting that the solutions present in this similar question will ignore the error, but will not recover the baz attribute: https://stackoverflow.com/questions/13046240/parseerror-not-well-formed-invalid-token-using-celementtree — , Jul 01 '19 at 04:29
I don't think this is possible. XML is "pedantic" by design. What you have is not XML, and a conformant parser is correct in rejecting it. — mzjn, Jul 01 '19 at 06:37
If I just regex this problem, would I have 0 problems afterward? or 2? — user541686, Jul 01 '19 at 08:02

score 2 · Accepted Answer · answered Jul 01 '19 at 08:30

Since it sounds like a solution may not be withing sight...

Until a better solution comes along, here's a hacky workaround for the next poor soul...

def xml_fixup(s):  # give it the XML as a tring
    flags = re.DOTALL
    pat_quotes = '\"[^\"]*\"|\'[^\']*\''
    re_quotes = re.compile('(%s)([^>\\s])' % pat_quotes, flags)  # TODO: cache
    re_pieces = re.compile('([^<]+)|(<)((?:[^\"\'>]+|%s)*)(>)' % pat_quotes, flags)  # TODO: cache
    pieces = re_pieces.findall(s)
    return s[:0].join(map(lambda m: m[0] or m[1] + re_quotes.sub('\\1 \\2', m[2]) + m[3], pieces))

print(xml_fixup('<foo bar=""baz=""></foo>'))  # <foo bar="" baz=""></foo>

Brownie points if you spot bugs in this!

it breaks correct attributes: print(xml_fixup('')) >> – Jeroen Sep 29 '21 at 10:08 — Jeroen, Sep 29 '21 at 10:08

How to make Python's ElementTree ignore lack of spaces between quotes and attributes?

1 Answers1