How to get pandas dataframe by chunks from csv files in huge tar.gz without unzipping and iterating over them?

Question

I have a huge compressed file on which I am interested in reading the individual dataframes, so as not to run out of memory.

Also, due to time and space, I can't unzip the .tar.gz.

This is the code I've got this far:

import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io

tar_file = tarfile.open(r'\\path\to\the\tar\file.tar.gz')

# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
    return \
        (
            (
                member.name, \
                pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
            )
               for member in tar_file
                   if member.isreg()\
        )

for filename, dataframe in generate_individual_df(tar_file):
    # But dataframe is the whole file, which is too big

Tried the How to create Panda Dataframe from csv that is compressed in tar.gz? but still can't solve ...

Have you looked at [this](https://docs.python.org/3/library/tarfile.html)? I'm pretty sure there's a way to only decompress specific files at a time. — Steele Farnsworth, Jan 04 '22 at 15:02
Maybe it answers your question: [https://stackoverflow.com/questions/39263929/how-can-i-read-tar-gz-file-using-pandas-read-csv-with-gzip-compression-option](https://stackoverflow.com/questions/39263929/how-can-i-read-tar-gz-file-using-pandas-read-csv-with-gzip-compression-option) — Sadegh Sh, Jan 04 '22 at 15:02
You can avoid loading full file into memory with pandas read_csv using chunksize parameter, where you specify the number of records you want to load into memory. — Marcello Chiuminatto, Jan 04 '22 at 15:05

score 0 · Accepted Answer · answered Jan 19 '22 at 12:18

You actually can iterate over the chunks inside a compressed file with the following function:

def generate_individual_df(tar_file, chunksize=10**4):
    return \
        (
            (
                member.name, \
                chunk
            )
            for member in tar_file
                if member.isreg()\
                for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
                  .read().decode('ascii')), header=None, chunksize=chunksize)
        )

How to get pandas dataframe by chunks from csv files in huge tar.gz without unzipping and iterating over them?

1 Answers1