Parsing and extracting information from large HTML files with python and lxml

Question

I would like to parse large HTML files and extract information from those files through xpath. Aiming to do that, I'm using python and lxml. However, lxml seems not to work well with large files, it can parse correctly files whose size isn't larger than around 16 MB. The fragment of code where it tries to extract information from HTML code though xpath is the following:

tree = lxml.html.fragment_fromstring(htmlCode)
links = tree.xpath("//*[contains(@id, 'item')]/div/div[2]/p/text()")

The variable htmlCode contains the HTML code read from a file. I also tried using parse method for reading the code from file instead of getting the code directly from a string, but it didn't work either. As the contents of file is read successfully from file, I think the problem is related to lxml. I've been looking for another libraries in order to parse HTML and use xpath, but it looks like lxml is the main library used for that.

Is there another method/function of lxml that deals better with large HTML files?

Could you be more specific on problems with parsing larger files? 16 MB is not so much (it is not small either, but it is not definitely something huge). What errors you get? — Jan Vlcinsky, Jun 10 '14 at 16:18
Actually I don't get an error, no exception is thrown by python, but nothing is returned by tree.xpath for files larger than 16 MB. I mean, it seems that I am only able to extract information from files smaller than 16 MB. — user12707, Jun 10 '14 at 18:40
Are [`etree.iterparse()` and `etree.iterwalk()`](http://lxml.de/parsing.html#iterparse-and-iterwalk) of any help? See this [question](https://stackoverflow.com/questions/7171140/using-python-iterparse-for-large-xml-files). — rypel, May 15 '15 at 17:52

score 2 · Answer 1 · edited Aug 01 '19 at 05:08

If the file is very large, you can use iterparse and add html=True argument to parse files without any validation. You need to manually create conditions for xpath.

from lxml import etree
import sys
import unicodedata

TAG = '{http://www.mediawiki.org/xml/export-0.8/}text'

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    # modified to call func() only in the event and elem needed
    for event, elem in context:
        if event == 'end' and elem.tag == TAG:
            func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem, fout):
    global counter
    normalized = unicodedata.normalize('NFKD', \
            unicode(elem.text)).encode('ASCII','ignore').lower()
    print >>fout, normalized.replace('\n', ' ')
    if counter % 10000 == 0: print "Doc " + str(counter)
    counter += 1

def main():
    fin = open("large_file", 'r')
    fout = open('output.txt', 'w')
    context = etree.iterparse(fin,html=True)
    global counter
    counter = 0
    fast_iter(context, process_element, fout)

if __name__ == "__main__":
main()

Source

Parsing and extracting information from large HTML files with python and lxml

1 Answers1