I have well-structured 60GB / 1 billion line XML file I'd like to extract specific tags from (<title>info</title>
), with each tag occuring across multiple lines. Instead of parsing the whole file with lxml
, I'd like to use a multiline regular expression. I would use re.findall()
, but that seems to risk a memory leak or excess memory usage. Instead, I want to write each match to a file as it's found, discard it, and move on.
What is a good way of doing this?