1

I'm trying to parse a local 14 mb html file.

My file looks like this (it's inconvenient because it's not nested in a useful way):

<html >
    <head>Title</head>
    <body>
        <p class="SECMAIN">
            <span class="ePub-B">\xc2\xa7 720 ILCS 5/10-8.1.</span>
        </p>
        <p class="INDENT-1”>(a) text</p>
        <p class="INDENT-1”>(b) text</p>
        <p class="INDENT-2”>(1) text</p>
        <p class="INDENT-2”>(2) text</p>
        <p class="SOURCE">(Source)</p>
        <p class="SECMAIN">
            <span class="ePub-B">\xc2\xa7 720 ILCS 5/10-9</span>
        </p>
        <p class="INDENT-1”>(a) something</p>
        <p class="SOURCE">(Source)</p>
        <p class="SECMAIN">
            <span class="ePub-B">\xc2\xa7 720 ILCS 5/10-10.</span>
       </p>
       <p class="INDENT-1”>(a) more text</p>
       <p class="SOURCE">(Source)</p>
    </body>
</html>

Although my code works instantaneously as desired on small samples of my html file (50 kb), it won't even begin one loop of the whole file. I’ve tried using mac and windows computers with 4 and 8 gigs of RAM respectively.

I gather from reading other posts for-loops involving largish xml files are very slow and un-pythonic, but I'm struggling to implement something like iterparse or list comprehension.

I tried to use list comprehension based on Populating Python list using data obtained from lxml xpath command, and I'm not sure of how to proceed with this interesting post either: python xml iterating over elements takes a lot of memory

This is the part of my code that can't handle the full file.

import lxml.html 
import cssselect 
import pandas as pd 

…

tree = lxml.html.fromstring(raw) 

laws = tree.cssselect('p.SECMAIN span.ePub-B') 

xpath_str = ''' 
    //p[@class="SECMAIN"][{i}]/
        following-sibling::p[contains(@class, "INDENT")]
            [count(.|//p[@class="SOURCE"][{i}]/
                        preceding-sibling::p[contains(@class, "INDENT")])
            = 
            count(//p[@class="SOURCE"][{i}]/
                        preceding-sibling::p[contains(@class, "INDENT")])
            ]
    '''

paragraphs_dict = {} 
paragraphs_dict['text'] = [] 
paragraphs_dict['n'] = [] 

# nested for loop:
for n in range(1, len(laws)+1): 
    law_paragraphs = tree.xpath(xpath_str.format(i = n)) # call xpath string
    for p in law_paragraphs: 
        paragraphs_dict['text'].append(p.text_content()) # store paragraph
        paragraphs_dict['n'].append(n)

The output should give me a dictionary with arrays of equal length so I can tell which law (’n’) each paragraph (‘p’) corresponds to. The goal is to capture all the elements of class "INDENT" that are between elements of class "SECMAIN" and "SOURCE", and record which SECMAIN they follow.

Thanks for your support.

Sam
  • 17
  • 4
  • It's immediately obvious that you've got two nested loops here so your solution will be quadratic; but I don't understand either the problem or your chosen technology well enough to suggest a solution. Personally, I wouldn't dream of coding this in a low-level procedural language; I would use a high-level declarative XML-oriented language such as XSLT or XQuery, which will have lots of optimiizations built in. – Michael Kay Mar 05 '20 at 09:06
  • What do you mean by quadratic? – Sam Mar 05 '20 at 16:37
  • Quadratic means that the execution time is proportional to the square of the input data size. – Michael Kay Mar 05 '20 at 21:45
  • Instead of describing your data, and expected output, please show a real (but minimal) example of your input document, _actual_ code (no placeholders for XPath expressions) and expected output. Thanks. – Mathias Müller Mar 06 '20 at 10:40
  • Thank you for your comments, @MichaelKay. Interesting and useful. – Sam Mar 06 '20 at 21:47
  • @MathiasMüller I updated the question to include sample file and code. Let me know if I should add or change anything to make it more clear. – Sam Mar 06 '20 at 21:49
  • In my opinion, your description of the expected output is still hard to understand. Why not show the expected output for the HTML you are showing? – Mathias Müller Mar 07 '20 at 10:51

1 Answers1

1

Consider your XPath expression: for every SECMAIN number, you iterate over the SECMAINs to that number, then you iterate twice over the SOURCEs to find the matching one, and then you check all the preceding INDENT and take the nodes that are among those. Even if there is some optimization, the finite state automata will have a lot of work to do! It may be worse than quadratic (see comments).

I would use a more direct approach with a sax parser.

import xml.sax
import io

class MyContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.n = 0
        self.d = {'text': [], 'n': []}
        self.in_indent = False

    def startElement(self, name, attributes):
        if name == "p" and attributes["class"] == "SECMAIN":
            self.n += 1 # next SECMAIN
        if name == "p" and attributes["class"].startswith("INDENT"):
            self.in_indent = True # mark that we are in an INDENT par
            self.cur = [] # to store chunks of text

    def endElement(self, name):
        if name == "p" and self.in_indent:
            self.in_indent = False # mark that we leave an INDENT par
            self.d['text'].append("".join(self.cur)) # append the INDENT text
            self.d['n'].append(self.n) # and the number

    def characters(self, data):
        # https://docs.python.org/3/library/xml.sax.handler.html#xml.sax.handler.ContentHandler.characters
        # "SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks"
        if self.in_indent: # only if an INDENT par:
            self.cur.append(data) # store the chunks

parser = xml.sax.make_parser()
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
handler = MyContentHandler()
parser.setContentHandler(handler)
parser.parse(io.StringIO(raw))

print(handler.d)
# {'text': ['(a) text', '(b) text', '(1) text', '(2) text', '(a) something', '(b) more text'], 'n': [1, 1, 1, 1, 2, 3]}

This should be a lot faster than the XPath version.

jferard
  • 7,835
  • 2
  • 22
  • 35
  • This worked and is incredibly fast--it parsed the whole file in a few seconds. Thank you very much. I'm trying to understand what's happening in the code. – Sam Mar 07 '20 at 16:16
  • Out of curiosity, why is io.String() necessary here? I read https://stackoverflow.com/questions/50418448/io-stringio-vs-open-in-python-3/50418544#50418544 and I'm not sure what kind of object parser.parse() is looking for. It seems strange that parser.parse() doesn't like my file whether I open() it as a string or as bytes. – Sam Mar 11 '20 at 15:21
  • @Sam You don't need specifically a `io.StringIO`, you need a file like object or a filename/path. I used `io.StringIO` to create this file like object from a plain string. From [the doc](https://docs.python.org/3/library/xml.sax.html): `XMLReader.parse(source)`: *[...] The source object can be a system identifier (a string identifying the input source – typically a file name or a URL), a pathlib.Path or path-like object, or an InputSource object.* – jferard Mar 11 '20 at 20:28