0

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.2 2006-08-23" file="US20110000001A1-20110106.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20101222" date-publ="20110106">
<us-bibliographic-data-application lang="EN" country="US">
<publication-reference>
<document-id>
<country>US</country>
<doc-number>20110000001</doc-number>
<kind>A1</kind>
<date>20110106</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>12838840</doc-number>
<date>20100719</date>
</document-id>
</application-reference>
<us-application-series-code>12</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>IL</country>
<doc-number>189088</doc-number>

I am trying to parse an XML data obtained from Google using Python. It is a large file around 500 MB in size. It has around hundred thousand lines which make it difficult for me to share the contents of xml file. One issue with the file is that it does not contain any parent node so I had to create dummy root for my work. However, I think the opening XML line is repeated multiple times throughout the file and there are also multiple instances of special character("!"). The code which I run throws parse error-

"XML.etree.ElementTree.ParseError: not well-formed (invalid token): line 414, column 2".

I think it is because that this line contains a special character. Here is the content of the line:

"<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>"

At the same time, the opening line in the XML is repeated multiple times throughout the file. Here is how the line looks like:

"<?XML version="1.0" encoding="UTF-8"?>"

Is there a way I can remove the multiple instances of these lines so that I can parse the file. As it is a large file I cannot post the contents of it. I have, though, posted few lines of the XML file just to give an idea of the content. Similar content is repeated throughout the file. Line 1 and 2 are repeated multiple times in the XML file and I am looking at some way to remove their multiple occurence. I have also attached my code snippet here.

import xml.etree.ElementTree as ET
import csv
import re

with open("ipa110106.xml") as f:
 xml = f.read()
 tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
 root = tree.getroot()

check_elem = root.find('./!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd"')
    
elem.remove(check_elem)

file.write('b.xml')
  • Your file (which I presume is [here](https://bulkdata.uspto.gov/data2/patent/application/redbook/fulltext/2011/ipa110106.zip)), contains multiple XML documents in a single file. For instructions on how to deal with such a beast, try: http://stackoverflow.com/questions/5687056/parsing-a-file-with-multiple-xmls-in-it http://stackoverflow.com/questions/4024739/how-to-parse-multiple-xml-documents-from-a-single-stream – Robᵩ Feb 14 '17 at 17:56
  • Thanks a lot! I will follow those instructions. – Harshit Feb 14 '17 at 18:05

0 Answers0