1

I have a huge compressed file on which I am interested in reading the individual dataframes, so as not to run out of memory.

Also, due to time and space, I can't unzip the .tar.gz.

This is the code I've got this far:

import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io

tar_file = tarfile.open(r'\\path\to\the\tar\file.tar.gz')

# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
    return \
        (
            (
                member.name, \
                pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
            )
               for member in tar_file
                   if member.isreg()\
        )

for filename, dataframe in generate_individual_df(tar_file):
    # But dataframe is the whole file, which is too big

Tried the How to create Panda Dataframe from csv that is compressed in tar.gz? but still can't solve ...

  • 1
    Have you looked at [this](https://docs.python.org/3/library/tarfile.html)? I'm pretty sure there's a way to only decompress specific files at a time. – Steele Farnsworth Jan 04 '22 at 15:02
  • Maybe it answers your question: [https://stackoverflow.com/questions/39263929/how-can-i-read-tar-gz-file-using-pandas-read-csv-with-gzip-compression-option](https://stackoverflow.com/questions/39263929/how-can-i-read-tar-gz-file-using-pandas-read-csv-with-gzip-compression-option) – Sadegh Sh Jan 04 '22 at 15:02
  • 1
    You can avoid loading full file into memory with pandas read_csv using chunksize parameter, where you specify the number of records you want to load into memory. – Marcello Chiuminatto Jan 04 '22 at 15:05

1 Answers1

0

You actually can iterate over the chunks inside a compressed file with the following function:

def generate_individual_df(tar_file, chunksize=10**4):
    return \
        (
            (
                member.name, \
                chunk
            )
            for member in tar_file
                if member.isreg()\
                for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
                  .read().decode('ascii')), header=None, chunksize=chunksize)
        )