Parsing broken XML with numbers as tag names

Question

I have lots of xml files that have keys that are in digit format i.e <12345>Golly</12345>

When parsing using ElementTree I get an error not well-formed (invalid token). I am assuming this because the keys are in digit format and not words. When I try to change/replace the keys into string by adding double quotes using regex

xmlstr = re.sub('<([\d]+)>','<"' + str(re.search('<([\d]+)>', xmlstr).group(1))+ '">',xmlstr)
xmlstr = re.sub('</([\d]+)>','</"' + str(re.search('</([\d]+)>', xmlstr).group(1))+ '">',xmlstr)

All other keys are replace using the first found key.(all keys end up being the same. whereas the keys themselves in the original file are unique in each document.) I guess the files were converted from json to xml directly. The keys should represent id number and the values are the names associated with the id number

I was wondering if there is a way to work with digits as keys, or if there is a way I can replace the keys one by one and not replacing all matches with one found string. .group(1) returns the first occurrence which causes the problem. Please Help.

This is not how to use `re.sub()`. The second argument should be a reference into the regexp, e.g. `re.sub(r"<(\d+)>", "", xmlstr)` — Pavel, Dec 26 '17 at 18:35
i can only do that way if and only if i know what am trying to replace... but i dont(the xml files are more are about 2500, one line files). — Harris, Dec 26 '17 at 18:50

alecxe · Accepted Answer · 2017-12-26T19:08:15.667

I think you need to have both the numeric tag name and the content captured in different saving groups and then reference them in the replacement string:

In [2]: data = "<content><12345>Golly</12345><67890>Jelly</67890></content>"

In [3]: re.sub(r"<(\d+)>(.*?)</\d+>", r'<item id="\1">\2</item>', data)
Out[3]: '<content><item id="12345">Golly</item><item id="67890">Jelly</item></content>'

Though, it is difficult to come up with something 100% reliable without having access to the possible variations of the input XML data. For instance, I am not sure if this expression going to handle nested numerical tags nicely.

You may also want to explore possibilities to parse the document in lxml's "recovery" mode.

Another possible tool that may help to deal with this situation is BeautifulSoup - you may try the non-traditional approach - parse the XML data with a lenient html5lib parser:

In [1]: from bs4 import BeautifulSoup

In [2]: data = "<content><12345>Golly</12345><67890>Jelly</67890></content>"

In [3]: soup = BeautifulSoup(data, "html5lib")
In [3]: print(soup.prettify())
<html>
 <head>
 </head>
 <body>
  <content>
   &lt;12345&gt;Golly
   <!--12345-->
   &lt;67890&gt;Jelly
   <!--67890-->
  </content>
 </body>
</html>

It is not the desired output, of course, but may be something you can work with and extract the keys and words.

i can then scrape as an html file, or convert it back to a json string... Thanks — Harris, Dec 26 '17 at 19:42

score 0 · Answer 2 · answered Dec 26 '17 at 19:12

0

lxml package will make your life easier than struggling with regex.

Take a look at the documentation page.

pip install lxml

file_path = 'your/xml/file.xml'
parser_obj = lxml.etree.XMLParser(recover=True)
lxml.etree.parse(file_path, parser=parser_obj)

answered Dec 26 '17 at 19:12

Krishna Teja S

1
1

It's not going to handle a case like `<12345>Golly12345><67890>Jelly67890>` as is..`67890` would be lost altogether. – alecxe Dec 26 '17 at 19:13
ahh... got you. – Krishna Teja S Dec 26 '17 at 20:03

Parsing broken XML with numbers as tag names

2 Answers2