1

So I'm a beginner 'scraper' with not a whole truckload of programming experience.

I'm using Python, in the Canopy environment, to scrape up some downloaded XML files and using the xml.dom parser to do so. I'm simply trying to scrape the tags from the very first us-bibliographic-patent-grant (which is why I'm using the [0]) just to see how I want to parse and store the entire dataset; rather than doing it all at once. An excerpt from the xml looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0606726-20091229.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20091214" date-publ="20091229">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0606726</doc-number>
<kind>S1</kind>
<date>20091229</date>
</document-id>
</publication-reference>
<application-reference appl-type="design">
<document-id>
<country>US</country>
<doc-number>29299001</doc-number>
<date>20071217</date>

My code so far looks like this:

from xml.dom import minidom

filename = "C:/Users/SMOLENSK/Documents/Inventor Research/xml_2009/ipg091229.xml"

f = open(filename, 'r')

doc = f.read()

f.close()

xmldata = '<root>' + doc + '</root>'

data = minidom.parse(xmldata)

US_Biblio = xmldata.getElementsByTagName("us-bibliographic-data-grant")[0]

pat_num = US_Biblio.getElementsByTagName("doc-number")[0]

dates = pat_num.getElementsByTagName("date")

for date in dates:
    print(date)

Now I have gotten some messages for Memory Errors after the code fully ran but it has only been able to run once and unfortunately I was unable to jot down what exactly happened. Due to the high load of data (this file alone being 4.6 million lines) the operation crashes most every time and I'm unable to replicate the Errors.

Is there anything anyone can see wrong with the code? My code is parsing the entire dataset before it starts storing each tag name but might there be a way to parse only a certain amount? Perhaps just make a new xml file with the first set.

If you're wondering I used the to bypass the issue of the

ExpatError: junk after line xxx

I was getting beforehand. I know my coding skills aren't amazing so hopefully i did not make a simple and disgusting programming error.

Community
  • 1
  • 1
HelloToEarth
  • 2,027
  • 3
  • 22
  • 48
  • You are duplicating the whole file to add the `` tags. `minidom.parse` will take a `file` object. Try recasting using `with` and `data = minidom.parse(f)` – Mike Robins Jul 28 '17 at 02:28
  • Hey, Mike. Sorry to say that although I do understand what you mean about my 'xmldata' that I'm unsure how to "recast using 'with'". Could you help clarify with an example by chance? – HelloToEarth Jul 28 '17 at 02:47
  • ... [Using Python Iterparse For Large XML Files](https://stackoverflow.com/q/7171140/2823755) ... Maybe try lxml. Also, minidomn has an [unlink](https://docs.python.org/3/library/xml.dom.minidom.html#xml.dom.minidom.Node.unlink) method that helps free up unused stuff. Every time you narrow down the search and make a new assignment (e.g.```US_Biblio =...```, try deleting the previous variable, (e.g.( ```del data ```) – wwii Jul 28 '17 at 03:22
  • Not what you asked, but have you considered sequentially reading line by line and using regex to find the `doc-number` and `date` fields? If that is all you want. – Mike Robins Jul 28 '17 at 06:30

1 Answers1

0

Try:

with open(filename, 'r') as f:
    data = minidom.parse(f)

If you really need the tag you may need to mess around a bit, maybe:

data = minidom.parse(itertools.chain('<root>', f, '</root>')
Mike Robins
  • 1,733
  • 10
  • 14
  • When I use the `itertools.chain` outside of the `with` statement I'm given the same _ExpatError: junk after line xxx..._ and within the `with` statement I get an error _AttributeError: 'itertools.chain' object has no attribute 'read'_ I'm assuming the first is again due to non-exact XML root elements repeating in the data itself but the attribute error happens because of? – HelloToEarth Jul 28 '17 at 14:07
  • The parse must want a `file` object (that has a read method). The chain we are giving it is an iterator that returns strings but obviously not what the parse wants. Is the XML well formed? If not maybe try the `BeautifulSoup` package to parse it. – Mike Robins Jul 31 '17 at 00:06
  • Have a look at this (question)[https://stackoverflow.com/questions/45395811/parsing-xml-with-beautiful-soup]. It is a duplicate of your question. – Mike Robins Jul 31 '17 at 02:14