0

Trying to read bulk data from US Patent and Trade Office. Have tried several xml files from here, I get the same results:

import xml.etree.ElementTree as ET
import re
file = 'ipgb20210105.xml'
tree = ET.parse(file)

yields: "ParseError: junk after document element: line 862, column 0"

Have tried recommendation to wrap with fake root node, but this doesn't work either:

with open(file) as f:
    xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")

yields: "ParseError: not well-formed (invalid token): line 2, column 2"

Any help much appreciated!

  • ipgb20210105.xml is not one big well-formed XML document. It consists of thousands of small XML documents (each with its own XML declaration) squashed together. – mzjn Apr 15 '21 at 17:36
  • Try [Python 3: Split concatenated XML files](https://stackoverflow.com/questions/50857535/python-3-split-concatenated-xml-files). – urznow Apr 16 '21 at 08:09

0 Answers0