0

I have a multi-GB 7z archive that contains a single xml file. I want to read a single line from this compressed file at a time till it's (the file's) EOF is reached on Python 3.4. I cannot afford to decompress it into it's full size, which is around a couple of Terabytes.

I was suggested many libraries like pylzma and lzma but they don't support 7z format. libarchive does support 7z but it reads in blocks, which are not necessarily lines of text in the file, I think.

Please provide suggestions. Thanks.

AayDee
  • 181
  • 12
  • 2
    If libarchive reads in block, you could use this to read until you find a `'\n'`, and yield it, creating your own lines generator. But are you sure your uncompressed file contains carriage returns? – DainDwarf Dec 07 '15 at 14:14
  • Yes, I am sure that it contains newline characters. Could you elaborate on using `yield` till the newline? I have a code for the `yield` part as you suggest from here: http://stackoverflow.com/questions/20104460/how-to-read-from-a-text-file-compressed-with-7z-in-python – AayDee Dec 07 '15 at 14:55

1 Answers1

0

(elaborating on the yield part) Note, I do not know this lib or what function you use to get blocks of uncompressed data. But I mean something like this:

def 7zreadline(filename):
    with open(filename, 'rb') as fh: #automatically closes filehandler when finished
        archive = py7zlib.Archive7z(fh)
        current_line = ''
        for block in archive.getblock(): #I do not know how you get a block of uncompressed data, so I ''abstract'' the call, you get the idea...
            current_line += block
            while '\n' in current_line:
                yield current_line[:current_line.index('\n')+1] # gives all until '\n' to the caller
                current_line = current_line[current_line.index('\n')+1:] # now, initialize current_line with the rest of your block.
        yield current_line #return the end of file

Then you can use it like that:

for line in 7zreadline('myfile.zip'):
    print(line)

If someone who knows the library can get something correct, edits are welcome.

wildwilhelm
  • 4,809
  • 1
  • 19
  • 24
DainDwarf
  • 1,651
  • 10
  • 19