57

I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?

Burhan Ali
  • 2,258
  • 1
  • 28
  • 38
Elyza Agosta
  • 581
  • 2
  • 5
  • 8
  • See my answer here [without downloading zip files] https://stackoverflow.com/a/45771620/348168 – Vinod Aug 19 '17 at 12:51

10 Answers10

96

I used the zipfile module to import the ZIP directly to pandas dataframe. Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":

import pandas as pd
import zipfile

zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip') 
df = pd.read_csv(zf.open('intfile.csv'))
ZygD
  • 22,092
  • 39
  • 79
  • 102
Yaron
  • 1,726
  • 14
  • 18
51

If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:

import csv
from io import TextIOWrapper
from zipfile import ZipFile

with ZipFile('yourfile.zip') as zf:
    with zf.open('your_csv_inside_zip.csv', 'r') as infile:
        reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
        for row in reader:
            # process the CSV here
            print(row)
Community
  • 1
  • 1
volker238
  • 2,170
  • 1
  • 18
  • 16
  • 5
    I tried doing this not realizing that I needed io.TextIOWrapper. How could I have known? – Ken Ingram Jul 21 '20 at 12:14
  • 1
    @KenIngram ZipFile.open() give a zipfile.ZipExtFile object. The built-in function open() function returns a _io.TextIOWrapper object directly – Dimitri_Fu Jul 14 '21 at 19:07
33

A quick solution can be using below code!

import pandas as pd

#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")
Hari Prasad
  • 1,162
  • 1
  • 10
  • 11
  • 1
    Outstanding answer! I check that using this same solution without the ".csv" extension also works: `df = pd.read_csv("/path/to/file.zip")` – Gian Arauz Mar 04 '21 at 14:46
10

zipfile also supports the with statement.

So adding onto yaron's answer of using pandas:

with zipfile.ZipFile('file.zip') as myZip:
    with myZip.open('file.csv') as myZipCsv:
        df = pd.read_csv(myZipCsv) 
kcrk
  • 164
  • 1
  • 8
gaius_baltar
  • 109
  • 1
  • 2
9

Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:

import os
import pandas as pd
import zipfile

curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []

print ("Uncompressing and reading data... ")

for text_file in text_files:
    print(text_file.filename)
    df = pd.read_csv(zf.open(text_file.filename))
    # do df manipulations
    list_.append(df)

df = pd.concat(list_)
Koedlt
  • 4,286
  • 8
  • 15
  • 33
Arthur D. Howland
  • 4,363
  • 3
  • 21
  • 31
5

Yes. You want the module 'zipfile'

You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])

You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])

brycem
  • 593
  • 3
  • 9
5

this is the simplest thing I always use.

import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')
SHR
  • 7,940
  • 9
  • 38
  • 57
4

Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:

#!/usr/bin/env python3

from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile

import requests

def all_tickers():
    url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
    r = requests.get(url)
    zip_ref = ZipFile(BytesIO(r.content))
    for name in zip_ref.namelist():
        print(name)
        with zip_ref.open(name) as file_contents:
            reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
            for item in reader:
                print(item)

This takes care of all python3 bytes/str issues.

hughdbrown
  • 47,733
  • 20
  • 85
  • 108
3

Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Anatoly Alekseev
  • 2,011
  • 24
  • 27
-1

If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip

you may simply do this:

df = pd.read_csv("my_big_file.zip")

Note: check your pandas version first (not applicable for older versions)

enter image description here

adhg
  • 10,437
  • 12
  • 58
  • 94