So I'm a beginner 'scraper' with not a whole truckload of programming experience.
I'm using Python, in the Canopy environment, to scrape up some downloaded XML files and using the xml.dom parser to do so. I'm simply trying to scrape the tags from the very first us-bibliographic-patent-grant (which is why I'm using the [0]) just to see how I want to parse and store the entire dataset; rather than doing it all at once. An excerpt from the xml looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0606726-20091229.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20091214" date-publ="20091229">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0606726</doc-number>
<kind>S1</kind>
<date>20091229</date>
</document-id>
</publication-reference>
<application-reference appl-type="design">
<document-id>
<country>US</country>
<doc-number>29299001</doc-number>
<date>20071217</date>
My code so far looks like this:
from xml.dom import minidom
filename = "C:/Users/SMOLENSK/Documents/Inventor Research/xml_2009/ipg091229.xml"
f = open(filename, 'r')
doc = f.read()
f.close()
xmldata = '<root>' + doc + '</root>'
data = minidom.parse(xmldata)
US_Biblio = xmldata.getElementsByTagName("us-bibliographic-data-grant")[0]
pat_num = US_Biblio.getElementsByTagName("doc-number")[0]
dates = pat_num.getElementsByTagName("date")
for date in dates:
print(date)
Now I have gotten some messages for Memory Errors after the code fully ran but it has only been able to run once and unfortunately I was unable to jot down what exactly happened. Due to the high load of data (this file alone being 4.6 million lines) the operation crashes most every time and I'm unable to replicate the Errors.
Is there anything anyone can see wrong with the code? My code is parsing the entire dataset before it starts storing each tag name but might there be a way to parse only a certain amount? Perhaps just make a new xml file with the first set.
If you're wondering I used the to bypass the issue of the
ExpatError: junk after line xxx
I was getting beforehand. I know my coding skills aren't amazing so hopefully i did not make a simple and disgusting programming error.