1

I have a zip directory similar to this one:

folder_to_zip
    - file_1.csv
    - folder_1.zip
        1. file_2.csv
        2. file_3.csv
        3. folder_2.zip
            **.**file_4.csv
            **.**file_5.csv
            **.** file_6.csv
    -file_7.csv

and I would like to "put" each csv file in a different pandas dataframe

The reason I want to do that is because I do not want this project to be too "heavy" ( the zip_folder is just 639MB insted of 7.66 GB)

based on these questions (Python: Open file in zip without temporarily extracting it, Python py7zr can't list files in archive - how to read 7z archive without extracting it) I tried something like this:

from py7zr import SevenZipFile as szf
import os
import pandas as pd


def unzip_(folder_to_zip):
    dfs= []
    if not folder_to_zip.endswith('.csv'):
        dfs.append(pd.read_csv(folder_to_zip))
    else:      
        with szf(folder_to_zip, 'r') as z:
            for f in z.getnames():
                dfs += unzip_(f)
    return dfs       

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • What is not working with the current approach? It is not clear from the question – Droid Mar 02 '23 at 10:07
  • you're probably going to need a lot more than 8GB RAM to do this, are you expecting that? – Sam Mason Mar 02 '23 at 10:07
  • @SamMason 2 thing: first of all thank you for the formatting, is really what I wanted to do but I do not know why it did not work (feel free to give me any suggestion). Secondly, why? I am not going to save anything –  Mar 02 '23 at 10:15
  • pandas only knows about uncompressed CSV data, you'd need to decompress them somewhere (presumably RAM) before it can do anything with them. even if you work one file at a time, pandas needs somewhere to store the loaded data in RAM and this is often similar in size to the raw csv data (numbers will likely take less ram, text will likely take more) – Sam Mason Mar 02 '23 at 10:20

1 Answers1

0

If you really want to do this, it would be something like:

import py7zr
import pandas as pd

ar = py7zr.SevenZipFile("archive.7z")
dfs = {}
for name, fd in ar.read(name for name in ar.getnames() if name.endswith(".csv")).items():
    dfs[name] = pd.read_csv(fd)

Note this loads into a dictionary rather than a list (as I'm not sure how well defined the ordering coming out of is).

But given the RAM requirements, this seems less useful in your use case.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • thank you; actually I have to store all the csv files into a MongoDb collection. Your advice is to extract the content of each file, read it with pandas and then store it in Mongo? My point was: this python script will be executed only one time, so what is the reason to save those files on the drive? There is no problem of RAM it just seemed to me smartest –  Mar 02 '23 at 10:38
  • if it's only going to be run once and it only takes a few GB then I don't think it matters what you do does it? if you really cared, then I wouldn't store all loaded dataframes in a list, I'd just insert them straight away so they're not all in memory at one time – Sam Mason Mar 02 '23 at 11:14