How can I read tar.gz file using pandas read_csv with gzip compression option?

Question

I have a very simple csv, with the following data, compressed inside the tar.gz file. I need to read that in dataframe using pandas.read_csv.

   A  B
0  1  4
1  2  5
2  3  6

import pandas as pd
pd.read_csv("sample.tar.gz",compression='gzip')

However, I am getting error:

CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2

Following are the set of read_csv commands and the different errors I get with them:

pd.read_csv("sample.tar.gz",compression='gzip',  engine='python')
Error: line contains NULL byte

pd.read_csv("sample.tar.gz",compression='gzip', header=0)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2

pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ")
CParserError: Error tokenizing data. C error: Expected 2 fields in line 94, saw 14    

pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ", engine='python')
Error: line contains NULL byte

What's going wrong here? How can I fix this?

Ok, so what should I do to read the tar.gz file without unzipping it? — Geet, Sep 01 '16 at 06:30
If it is a single file, why are you `tar`-ing it? Why not just `gzip` it? That way you can use pd.read_csv() on it directly. — Nehal J Wani, Sep 01 '16 at 06:32
I am not tar-ing it. It's given and I can't unzip the original file as it's more that 100 GB. — Geet, Sep 01 '16 at 06:36
The actual file is here... https://ghtstorage.blob.core.windows.net/downloads/mysql-2016-07-19.tar.gz — Geet, Sep 01 '16 at 06:37
If you manually unzip/untar and try to read the actual CSV file, does it work? — BrenBarn, Sep 01 '16 at 07:02
Yes, that works. But, I need to do it through Python program! — Geet, Sep 01 '16 at 07:03
@Geet: No, I mean if you unzip/untar it and try to use `read_csv` on the actual CSV file, instead of trying to have pandas do the unzipping. — BrenBarn, Sep 01 '16 at 07:07

Marlon Abeykoon · Accepted Answer · 2016-09-01T07:13:20.767

84

df = pd.read_csv('sample.tar.gz', compression='gzip', header=0, sep=' ', quotechar='"', error_bad_lines=False)

Note: error_bad_lines=False will ignore the offending rows.

edited Sep 01 '16 at 07:13

answered Sep 01 '16 at 06:30

Marlon Abeykoon

11,927
4
54
75

Thanks, Marlon. What's ".dat" in 3rd line, here? – Geet Sep 01 '16 at 06:34
When I try that, it says, KeyError: "filename 'sample.dat' not found" – Geet Sep 01 '16 at 06:41
@Geet and also tell me your pandas version. This should work for 0.18.1 – Marlon Abeykoon Sep 01 '16 at 06:45
My pandas version is 0.18.1. The updated code give me "CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2" error – Geet Sep 01 '16 at 06:53
2

This worked for me for a sample csv file. your link let me download 40GB. don't you have a sample of it for me to test? – Marlon Abeykoon Sep 01 '16 at 06:58
Check the code I set `error_bad_lines=False` and just now noticed your sep is an empty string. Can you try again with updated answer – Marlon Abeykoon Sep 01 '16 at 07:02
Did it actually read all the data correctly? Using `error_bad_lines` will just cause it to skip lines with errors, so the result may be missing some rows. Those errors might indicate an actual error in the data file. – BrenBarn Sep 01 '16 at 07:20
@BrenBarn: You are right, I see too many lines being skipped with that option. Any solution for that? – Geet Sep 01 '16 at 07:23
@Geet: I suspect there is some corruption in the file, but it could be hard to pinpoint it if the file is 40G. – BrenBarn Sep 01 '16 at 07:27
I have tested with 2-3 much smaller tar.gz files, but still facing the same issue. – Geet Sep 01 '16 at 07:30
@Geet: It would be good if you could supply a reasonably-sized sample file (like a few KB) to test with. – BrenBarn Sep 05 '16 at 22:52
@BrenBarn: I do have a 1-2 KB file, but how can I supply it here? – Geet Sep 06 '16 at 23:54
@Geet: You'll have to upload it somewhere and provide a link, as you did with your original file. – BrenBarn Sep 07 '16 at 02:48
For pandas 2.0 it will be like: ```df = pd.read_csv(r"path_to_csv.gz_file",compression="gzip",header=0,sep=",",quotechar='"',on_bad_lines="skip") ``` – Arun Apr 12 '23 at 10:06

score 14 · Answer 2 · answered May 30 '19 at 17:52

14

You can use the tarfile module to read a particular file from the tar.gz archive (as discussed in this resolved issue). If there is only one file in the archive, then you can do this:

import tarfile
import pandas as pd
with tarfile.open("sample.tar.gz", "r:*") as tar:
    csv_path = tar.getnames()[0]
    df = pd.read_csv(tar.extractfile(csv_path), header=0, sep=" ")

The read mode r:* handles the gz extension (or other kinds of compression) appropriately. If there are multiple files in the zipped tar file, then you could do something like csv_path = list(n for n in tar.getnames() if n.endswith('.csv'))[-1] line to get the last csv file in the archived folder.

answered May 30 '19 at 17:52

teichert

3,963
1
31
37

1

Isn't `r:*` (or equivalently `r`) the default? I don't see what benefit it has to specify it explicitly. – Asclepius Feb 22 '21 at 23:16
@Asclepius explicit is better than implicit -zen of python – tmthyjames Sep 25 '22 at 11:58
@tmthyjames Maybe you would like to program in C instead where everything is as explicit as it can be. – Asclepius Sep 25 '22 at 15:15
@Asclepius i can barely code in python! :) – tmthyjames Sep 30 '22 at 13:41

How can I read tar.gz file using pandas read_csv with gzip compression option?

2 Answers2

Linked