I'm trying to read an xml inside stackoverflow's data dumps but I discovered that it's not that easy. The idea is to read each tag and extract something. I didn't find a way to read the file on the fly. Here is my code (archive_path is a 7z file containing filename)
import py7zlib
from xml.etree.cElementTree import iterparse
def test(archive_path,filename):
with open(archive_path,'rb') as fp:
archive = py7zlib.Archive7z(fp)
context = iterparse(archive.getmember(filename), events=("start", "end"))
context = iter(context)
event, root = next(context)
for event, elem in context:
if event == "end" and elem.tag == "row":
import code; code.interact(local=locals())
root.clear()
(note the import code I use for debugging) Here is a link to a random small stackexchange dump https://archive.org/download/stackexchange/webmasters.meta.stackexchange.com.7z
The problem I face right now is that the iterparse seems not to work (python3) and the file seems not to be read correctly (can it be read on the fly?).
Edit: The code as it is now gives an exception on next(content)
File "stack.py", line 25, in readRows
event, root = next(context) #context.__next__()
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1223, in iterator
data = source.read(16 * 1024)
TypeError: read() takes 1 positional argument but 2 were given
The error is probably due to the py7zlib's files, that have only a read() that reads the full file