12

I have a pandas data frame, called df.

I want to save this in a gzipped format. One way to do this is the following:

import gzip
import pandas

df.save('filename.pickle')
f_in = open('filename.pickle', 'rb')
f_out = gzip.open('filename.pickle.gz', 'wb')
f_out.writelines(f_in)
f_in.close()
f_out.close()

However, this requires me to first create a file called filename.pickle. Is there a way to do this more directly, i.e., without creating the filename.pickle?

When I want to load the dataframe that has been gzipped I have to go through the same step of creating filename.pickle. For example, to read a file filename2.pickle.gzip, which is a gzipped pandas dataframe, I know of the following method:

f_in = gzip.open('filename2.pickle.gz', 'rb')
f_out = gzip.open('filename2.pickle', 'wb')
f_out.writelines(f_in)
f_in.close()
f_out.close()

df2 = pandas.load('filename2.pickle')

Can this be done without creating filename2.pickle first?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Curious2learn
  • 31,692
  • 43
  • 108
  • 125
  • You are mixing the phrase "zipped" and "zipped format" with code that uses gzip, which is not correct. zip and gzip (.gz) are two different, incompatible formats. If you really want the zip format, then gzip code will not do that for you. If you want gzip-formatted data, then call it gzipped, not zipped. – Mark Adler Oct 23 '12 at 14:58
  • I want gzipped. I want to get rid of the intermediate step of creating the non-gzipped file. I have corrected the term used. – Curious2learn Oct 23 '12 at 15:01
  • @Curious2learn the information for this answer has changed. Would you mind reviewing the answers and accepting a new one? – Seanny123 Jun 14 '17 at 11:37

3 Answers3

17

Better serialization with compression has recently been added to Pandas. (Starting in pandas 0.20.0.) Here is an example of how it can be used:

df.to_csv("my_file.gz", compression="gzip")

For more information, such as different forms of compression available, check out the docs.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
Seanny123
  • 8,776
  • 13
  • 68
  • 124
2

For some reason, the Python zlib module has the ability to decompress gzip data, but it does not have the ability to directly compress to that format. At least as far as what is documented. This is despite the remarkably misleading documentation page header "Compression compatible with gzip".

You can compress to the zlib format instead using zlib.compress or zlib.compressobj, and then strip the zlib header and trailer and add a gzip header and trailer, since both the zlib and gzip formats use the same compressed data format. This will give you data in the gzip format. The zlib header is fixed at two bytes and the trailer at four bytes, so those are easy to strip. Then you can prepend a basic gzip header of ten bytes: "\x1f\x8b\x08\0\0\0\0\0\0\xff" (C string format) and append a four-byte CRC in little-endian order. The CRC can be computed using zlib.crc32.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thanks, Mark. I actually don't care about the compression format too much as long as it is decent in speed and size (does not have to be the best). If so, can I use zlib directly? I will also look into zlib documentation. Thanks again. – Curious2learn Oct 23 '12 at 15:50
  • Yes, you can just use the zlib format directly with the Python zlib module. It is the same compressor and used by gzip and so compresses the same, minus eight bytes of header. – Mark Adler Oct 23 '12 at 17:23
  • Mark, the only examples I see are where strings are compressed. I don't see an example where pandas dataframe is compressed and saved. Can this be done? Thanks. – Curious2learn Oct 24 '12 at 01:45
  • I don't know what a dataframe is, but you should be able to use pickle.dumps to convert any object into a string. – Mark Adler Oct 24 '12 at 02:03
1

You can dump dataframe into string using pickle.dumps and then write it on disk with import gzip

file = gzip.GzipFile('filename.pickle.gz', 'wb', 3)
file.write(pickle.dumps(df))
file.close()
Viacheslav Nefedov
  • 2,259
  • 3
  • 15
  • 15