2

I have a folder with many zip files and within those zip files are multiple csv files. Is there any way to get all of the .csv files in one dataframe in python? Or any way I can pass a list of zip files?

The code I am currently trying is:

import glob
import zipfile
import pandas as pd

for zip_file in glob.glob(r"C:\Users\harsh\Desktop\Temp\data_00-01.zip"):
    # This is just one file. There are multiple zip files in the folder
    zf = zipfile.ZipFile(zip_file)
    dfs = [pd.read_csv(zf.open(f), header=None, sep=";", encoding='latin1') for f in zf.namelist()]
    df = pd.concat(dfs,ignore_index=True)
    print(df)

This code works for one zipfile but I have about 50 zip files in the folder and I would like to read and concatenate all csv files in those zip files in one dataframe.

Thanks

PyNoob
  • 31
  • 6
  • You'll need to get the names of all the files in the folder. See here for ways to do that: https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory – Arthur Morris Oct 21 '20 at 08:20

1 Answers1

0

The following code should satisfy your requirements (just edit dir_name according to what you need):

import glob
import zipfile
import pandas as pd

dfs = []
for filename in os.listdir(dir_name):
    if filename.endswith('.zip'):
        zip_file = os.path.join(dir_name, filename)
        zf = zipfile.ZipFile(zip_file)
        dfs += [pd.read_csv(zf.open(f), header=None, sep=";", encoding='latin1') for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
Yahav Festinger
  • 985
  • 2
  • 8
  • 17
  • The code still is reading just 1 zip file. Should I list the zip file names in `dfs = [] ` ? Also Where would I give path to the folder? – PyNoob Oct 21 '20 at 08:20
  • I replaced `os.listdir(dir_name)` with `os.listdir(r"C:\Users\harsh\Desktop\Temp")` and I am getting this error `FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\harsh\\AppData\\Roaming\\JetBrains\\PyCharmCE2020.2\\scratches\\data_00-01.zip'` . How would I resolve it? – PyNoob Oct 21 '20 at 11:35
  • Great! That works now. I see my mistake. However, now, I am getting a single column with all the values in the dataframe without a header. Is there a way to get the dataframe to format as per column header and respective values? Thanks – PyNoob Oct 21 '20 at 12:51
  • Can you provide the current output, and the expected output? – Yahav Festinger Oct 21 '20 at 13:02
  • Sure! Please find current output [here](http://wikisend.com/download/283564/Current_output.csv) and expected output file [here](http://wikisend.com/download/694092/Expected_output.csv) . The expected output is in the `"C:\Users\harsh\Desktop\Temp\data_19-20.zip"` file – PyNoob Oct 21 '20 at 13:20
  • Please find the entire current output file at: http://wikisend.com/download/653182/Current_output.csv – PyNoob Oct 21 '20 at 13:33
  • Maybe ```pd.concat(dfs,ignore_index=True, axis=1)``` it's what you mean for – Yahav Festinger Oct 21 '20 at 13:38
  • How can I use the Expected output as an example and loop all the CSVs in the zip files as per that? Because if I include `axis=1` all the csv files are lined horizontally (in different columns and still all values in 1 column) instead of a vertical concatenation – PyNoob Oct 21 '20 at 13:47
  • It will be much better if you include link to all the zip files you are using so I be able to see the original data – Yahav Festinger Oct 21 '20 at 13:59
  • Sure. I am sorry, I should had done this before I started. Please find the shared folder here: https://drive.google.com/drive/folders/1OdNtJKRyS09Ws_73ovhqKwjj8aftb9mm?usp=sharing – PyNoob Oct 21 '20 at 14:10
  • Could you figure out a way forward? I seem to be stuck here. Thanks – PyNoob Oct 22 '20 at 01:33
  • Your csv files don't have the same formation (some with more columns than the others), how do you want to handle this? – Yahav Festinger Oct 22 '20 at 07:03
  • The expected outcome file / `data_19-20.zip` file has all the columns. How can I use that as a reference and populate all relative columns in the csv for all the respective columns in the files in the folder? – PyNoob Oct 23 '20 at 11:40