Read huge xml inside a 7z archive

Question

I'm trying to read an xml inside stackoverflow's data dumps but I discovered that it's not that easy. The idea is to read each tag and extract something. I didn't find a way to read the file on the fly. Here is my code (archive_path is a 7z file containing filename)

import py7zlib
from xml.etree.cElementTree import iterparse

def test(archive_path,filename):
    with open(archive_path,'rb') as fp:
        archive = py7zlib.Archive7z(fp)
        context = iterparse(archive.getmember(filename), events=("start", "end"))
        context = iter(context)
        event, root = next(context)
        for event, elem in context:
            if event == "end" and elem.tag == "row":
                import code; code.interact(local=locals())
                root.clear()

(note the import code I use for debugging) Here is a link to a random small stackexchange dump https://archive.org/download/stackexchange/webmasters.meta.stackexchange.com.7z

The problem I face right now is that the iterparse seems not to work (python3) and the file seems not to be read correctly (can it be read on the fly?).

Edit: The code as it is now gives an exception on next(content)

 File "stack.py", line 25, in readRows
   event, root = next(context) #context.__next__()
 File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1223, in iterator
   data = source.read(16 * 1024)
TypeError: read() takes 1 positional argument but 2 were given

The error is probably due to the py7zlib's files, that have only a read() that reads the full file

What is that line for: `data = archive.getmember(filename).read()`. It reads the whole file into memory, which is wasteful, and you're not even using `data` anywhere. — Tomalak, Mar 04 '19 at 17:08
Also, I presume you're doing `data[3:]` to skip the UTF-8 BOM? There is no reason to do that, either. The XML parser knows how to handle the BOM. — Tomalak, Mar 04 '19 at 17:12
Apologies. those are lines of my previous attempt. I'll write a comment — maugch, Mar 05 '19 at 08:10
I'd recommend to remove them from the code sample entirely when you're not using them anymore. Also please specify "seems not to work". What's not happening, what happens instead, what are your expectations? — Tomalak, Mar 05 '19 at 10:38
In the meantime, I have found [this thread](https://stackoverflow.com/questions/20104460/how-to-read-from-a-text-file-compressed-with-7z). It's old, but I think the core point still applies: *"`py7zlib` doesn't provide an API that would allow archive members to be read as a stream of bytes or characters"*. That would mean `py7zlib` is responsible for the impression that `iterparse` doesn't parse iteratively. If you unpack the 7z file and try on the raw XML it will likely work as expected. — Tomalak, Mar 05 '19 at 10:54
I just found a feature request about it. https://github.com/fancycode/pylzma/issues/58 thanks anyway @Tomalak — maugch, Mar 05 '19 at 11:04

Read huge xml inside a 7z archive

0 Answers0