2

I have well-structured 60GB / 1 billion line XML file I'd like to extract specific tags from (<title>info</title>), with each tag occuring across multiple lines. Instead of parsing the whole file with lxml, I'd like to use a multiline regular expression. I would use re.findall(), but that seems to risk a memory leak or excess memory usage. Instead, I want to write each match to a file as it's found, discard it, and move on.

What is a good way of doing this?

zadrozny
  • 1,631
  • 3
  • 22
  • 27
  • Use `readlines()`, read bunch of lines at a time, and keep writing the data you want to a new file. – Jay Nov 14 '17 at 20:33
  • 3
    multiline? you could use a revolving buffer of lines, or read chunk by chunk but it's difficult to get proper data in a whole block. – Jean-François Fabre Nov 14 '17 at 20:35
  • Read until you find a `` tag then save while reading further til you find the *end* tag, parse that block with lxml? – wwii Nov 14 '17 at 20:38
  • @ctwheels https://stackoverflow.com/a/1733489/1366410 – zadrozny Nov 14 '17 at 20:39
  • @wwii How would I go about reading until I find a tag? Are you suggesting I read line by line and search within each line? – zadrozny Nov 14 '17 at 20:41
  • @zadrozny as much as I'd love to agree with you, I've seen people abuse regex to try and scrape HTML or XML. The answer you linked to even says *limited*, *known* set of HTML. It may work for your subset, but don't rely on it. [HTML is not a regular language and hence cannot be parsed by regular expressions](https://stackoverflow.com/a/1732454/3600709) – ctwheels Nov 14 '17 at 20:43
  • `... read line by line and search within each line?` its a start. How long could it take? – wwii Nov 14 '17 at 20:46
  • 3
    Its common to use a SAX parser or something like `ElementTree.iterparse` for this type of thing. They read the input stream, parse XML, and emit events for tags, etc. Why bother with a regex when you've got other tools for the job? – tdelaney Nov 14 '17 at 20:46
  • @ctwheels Fair. But the xml in question is likely well structured, it seems wasteful to parse the whole tree for something basic, and I worry my host will kill my process if the memory usage runs high. – zadrozny Nov 14 '17 at 20:48

0 Answers0