I'm trying to parse a local 14 mb html file.
My file looks like this (it's inconvenient because it's not nested in a useful way):
<html >
<head>Title</head>
<body>
<p class="SECMAIN">
<span class="ePub-B">\xc2\xa7 720 ILCS 5/10-8.1.</span>
</p>
<p class="INDENT-1”>(a) text</p>
<p class="INDENT-1”>(b) text</p>
<p class="INDENT-2”>(1) text</p>
<p class="INDENT-2”>(2) text</p>
<p class="SOURCE">(Source)</p>
<p class="SECMAIN">
<span class="ePub-B">\xc2\xa7 720 ILCS 5/10-9</span>
</p>
<p class="INDENT-1”>(a) something</p>
<p class="SOURCE">(Source)</p>
<p class="SECMAIN">
<span class="ePub-B">\xc2\xa7 720 ILCS 5/10-10.</span>
</p>
<p class="INDENT-1”>(a) more text</p>
<p class="SOURCE">(Source)</p>
</body>
</html>
Although my code works instantaneously as desired on small samples of my html file (50 kb), it won't even begin one loop of the whole file. I’ve tried using mac and windows computers with 4 and 8 gigs of RAM respectively.
I gather from reading other posts for-loops involving largish xml files are very slow and un-pythonic, but I'm struggling to implement something like iterparse or list comprehension.
I tried to use list comprehension based on Populating Python list using data obtained from lxml xpath command, and I'm not sure of how to proceed with this interesting post either: python xml iterating over elements takes a lot of memory
This is the part of my code that can't handle the full file.
import lxml.html
import cssselect
import pandas as pd
…
tree = lxml.html.fromstring(raw)
laws = tree.cssselect('p.SECMAIN span.ePub-B')
xpath_str = '''
//p[@class="SECMAIN"][{i}]/
following-sibling::p[contains(@class, "INDENT")]
[count(.|//p[@class="SOURCE"][{i}]/
preceding-sibling::p[contains(@class, "INDENT")])
=
count(//p[@class="SOURCE"][{i}]/
preceding-sibling::p[contains(@class, "INDENT")])
]
'''
paragraphs_dict = {}
paragraphs_dict['text'] = []
paragraphs_dict['n'] = []
# nested for loop:
for n in range(1, len(laws)+1):
law_paragraphs = tree.xpath(xpath_str.format(i = n)) # call xpath string
for p in law_paragraphs:
paragraphs_dict['text'].append(p.text_content()) # store paragraph
paragraphs_dict['n'].append(n)
The output should give me a dictionary with arrays of equal length so I can tell which law (’n’) each paragraph (‘p’) corresponds to. The goal is to capture all the elements of class "INDENT" that are between elements of class "SECMAIN" and "SOURCE", and record which SECMAIN they follow.
Thanks for your support.