read a .7z file in memory with Python, and process each line as a stream

Question

I'm working with a huge .7z file that I need to process line by line.

First I tried py7zr, but it only works by first decompressing the whole file into an object. This runs out of memory.

Then libarchive is able to read block by block, but there's no straightforward way of splitting these binary blocks into lines.

What can I do?

Related questions I researched first:

How to read contents of 7z file using python: The answers only decompress the whole file.
How to read from a text file compressed with 7z?: Seeks Python 2.7 answers.
Python: How can I read a line from a compressed 7z file in Python?: Focuses on a single line, no accepted answer - only answer posted 7 years ago.

I'm looking for ways to improve the temporary solution I built myself - posted as an answer here. Thanks!

score 2 · Accepted Answer · answered Mar 18 '23 at 04:18

This solution goes through all available get_blocks(). If the last line doesn't end in \n, we keep the remaining bytes to be yield on the next block.

import libarchive

def process(my_file):
    data = ''
    with libarchive.file_reader(my_file) as e:
        for entry in e:
            for block in entry.get_blocks():
                data += block.decode('ISO-8859-1')
                lines = data.splitlines()
                if not data.endswith('\n'):
                    data = lines.pop()
                else:
                    data = ''
                for line in lines:
                    yield ({'l': line},)

read a .7z file in memory with Python, and process each line as a stream

1 Answers1