Apply GZIP compression to a CSV in Python Pandas

Question

I am trying to write a dataframe to a gzipped csv in python pandas, using the following:

import pandas as pd
import datetime
import csv
import gzip

# Get data (with previous connection and script variables)
df = pd.read_sql_query(script, conn)

# Create today's date, to append to file
todaysdatestring = str(datetime.datetime.today().strftime('%Y%m%d'))
print todaysdatestring

# Create csv with gzip compression
df.to_csv('foo-%s.csv.gz' % todaysdatestring,
      sep='|',
      header=True,
      index=False,
      quoting=csv.QUOTE_ALL,
      compression='gzip',
      quotechar='"',
      doublequote=True,
      line_terminator='\n')

This just creates a csv called 'foo-YYYYMMDD.csv.gz', not an actual gzip archive.

I've also tried adding this:

#Turn to_csv statement into a variable
d = df.to_csv('foo-%s.csv.gz' % todaysdatestring,
      sep='|',
      header=True,
      index=False,
      quoting=csv.QUOTE_ALL,
      compression='gzip',
      quotechar='"',
      doublequote=True,
      line_terminator='\n')

# Write above variable to gzip
 with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as output:
   output.write(d)

Which fails as well. Any ideas?

Using `df.to_csv` with `compression='gzip'` produces a gzip archive for me. I used the same keyword arguments as you. What version of pandas as you using? See the output of `pd.__version__` to determine this. It looks like gzip wasn't implemented until 0.17.1, but trying to use it in earlier versions will not produce an error. — root, May 12 '16 at 17:14
[here](http://stackoverflow.com/a/37012035/5741205) are some __tested__ examples. Actually i would recommend to switch to HDF5 - it's much faster and more convenient! — MaxU - stand with Ukraine, May 12 '16 at 17:20
@root, that was the issue! I ran `easy_install --upgrade pandas` from the command line and upgraded from 16.1 to 18.1, and the "# Create csv with gzip compression" code from the top section worked as expected. Should I edit/delete the main post to reflect this? — user2752159, May 12 '16 at 18:27
I'll write my comment up as an answer, and you can accept it. — root, May 12 '16 at 19:00

score 93 · Accepted Answer · answered May 12 '16 at 19:01

93

Using df.to_csv() with the keyword argument compression='gzip' should produce a gzip archive. I tested it using same keyword arguments as you, and it worked.

You may need to upgrade pandas, as gzip was not implemented until version 0.17.1, but trying to use it on prior versions will not raise an error, and just produce a regular csv. You can determine your current version of pandas by looking at the output of pd.__version__.

answered May 12 '16 at 19:01

root

32,715
6
74
87

@cᴏʟᴅsᴘᴇᴇᴅ Read the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) if you are working on a Jupyter notebook you can just type `?df.to_csv` – rpanai Mar 08 '18 at 12:24
6

just want to add a documented limitation that `compression=gzip` works only when the first argument of `.to_csv()` is a file name. it won't work if it's a file object. – shawnzhu Aug 06 '18 at 17:36
3

Note: since concatenating gunzips produces another valid gunzip, you can repeatedly do `df.to_csv(filename, compression='gzip', mode='a')` to combine dataframes into one big gunzipped file. Useful if your data doesn't fit in memory. – BallpointBen Dec 21 '18 at 21:44
2

note that the default of `compression` is `'infer'` which means that the compression type is inferred from the file suffix, so you should be fine just appending `.gz` to your filename. – dopexxx Oct 07 '19 at 16:57
df.to_csv(filename, compression='gzip', mode='a') Trying this only for the firsttime , the csv file inside the gzip is being updated. append seems to be not working for me. Each row is read once into df when the function is called, so this action should be repeated with append functionality. – Nagasri Varma Jul 09 '20 at 15:35
Also, `.bz2` is now also supported; faster, smaller, and widespread tooling. – dsz Dec 06 '22 at 21:12

score 53 · Answer 2 · answered Sep 06 '17 at 10:55

53

It is done very easily with pandas

import pandas as pd

Write a pandas dataframe to disc as gunzip compressed csv

df.to_csv('dfsavename.csv.gz', compression='gzip')

Read from disc

df = pd.read_csv('dfsavename.csv.gz', compression='gzip')

answered Sep 06 '17 at 10:55

Ioannis Nasios

8,292
4
33
55

score 11 · Answer 3 · answered May 12 '16 at 17:07

From documentation

import gzip
content = "Lots of content here"
with gzip.open('file.txt.gz', 'wb') as f:
    f.write(content)

with pandas

import gzip


content = df.to_csv(
      sep='|',
      header=True,
      index=False,
      quoting=csv.QUOTE_ALL,
      quotechar='"',
      doublequote=True,
      line_terminator='\n')

with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f:
    f.write(content)

The trick here being that to_csv outputs text if you don't pass it a filename. Then you just redirect that text to gzip's write method.

is this more efficient that simply passing the argument to gzip? I'm currently (as I type this) extracting about 40gb worth of tables from an old dB and decided to just loop through each table and write each iteration out with `compression='gzip'` answer but wondering if I should have used this — Umar.H, Sep 05 '19 at 17:27

score 1 · Answer 4 · answered May 12 '16 at 17:05

1

with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f:
    f.write(df.to_csv(sep='|', index=False, quoting=csv.QUOTE_ALL))

answered May 12 '16 at 17:05

Alexander

105,104
32
201
196

Apply GZIP compression to a CSV in Python Pandas

4 Answers4

Linked

Related