Processing large file with python regex without overwhelming memory

Question

I have well-structured 60GB / 1 billion line XML file I'd like to extract specific tags from (<title>info</title>), with each tag occuring across multiple lines. Instead of parsing the whole file with lxml, I'd like to use a multiline regular expression. I would use re.findall(), but that seems to risk a memory leak or excess memory usage. Instead, I want to write each match to a file as it's found, discard it, and move on.

What is a good way of doing this?

Use `readlines()`, read bunch of lines at a time, and keep writing the data you want to a new file. — Jay, Nov 14 '17 at 20:33
multiline? you could use a revolving buffer of lines, or read chunk by chunk but it's difficult to get proper data in a whole block. — Jean-François Fabre, Nov 14 '17 at 20:35
Read until you find a `` tag then save while reading further til you find the *end* tag, parse that block with lxml? — wwii, Nov 14 '17 at 20:38
@wwii How would I go about reading until I find a tag? Are you suggesting I read line by line and search within each line? — zadrozny, Nov 14 '17 at 20:41
@zadrozny as much as I'd love to agree with you, I've seen people abuse regex to try and scrape HTML or XML. The answer you linked to even says *limited*, *known* set of HTML. It may work for your subset, but don't rely on it. [HTML is not a regular language and hence cannot be parsed by regular expressions](https://stackoverflow.com/a/1732454/3600709) — ctwheels, Nov 14 '17 at 20:43
`... read line by line and search within each line?` its a start. How long could it take? — wwii, Nov 14 '17 at 20:46
Its common to use a SAX parser or something like `ElementTree.iterparse` for this type of thing. They read the input stream, parse XML, and emit events for tags, etc. Why bother with a regex when you've got other tools for the job? — tdelaney, Nov 14 '17 at 20:46
@ctwheels Fair. But the xml in question is likely well structured, it seems wasteful to parse the whole tree for something basic, and I worry my host will kill my process if the memory usage runs high. — zadrozny, Nov 14 '17 at 20:48

Processing large file with python regex without overwhelming memory

0 Answers0