Pickling pandas dataframe multiplies by 5 the file size

Question

I am reading a 800 Mb CSV file with pandas.read_csv, and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplied by 5.

I expected pickle to compress data rather than extend it. Also because I can do a gzip on the CSV file which compress it to 200 Mb, dividing it by 4.

I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am understanding that I would rather have to compress the files and then use the compression option from pandas.read_csv to speed up the loading time.

Is that correct?

Is it normal that pickling pandas dataframe extend the data size?

How do you speed up loading time usually?

What are the data-size limit would you load with pandas?

score 4 · Answer 1 · answered May 15 '15 at 08:20

Not sure why you think pickling compresses the data size, pickling creates a string version of your python object so that it can be loaded back as a python object:

In [388]:

import sys
import os
df = pd.DataFrame({'a':np.arange(5)})
df.to_pickle(r'c:\data\df.pkl')
print(sys.getsizeof(df))
statinfo = os.stat(r'c:\data\df.pkl')
print(statinfo.st_size)
with open(r'c:\data\df.pkl', 'rb') as f:
    print(f.read())
56
700
b'\x80\x04\x95\xb1\x02\x00\x00\x00\x00\x00\x00\x8c\x11pandas.core.frame\x94\x8c\tDataFrame\x94\x93\x94)}\x94\x92\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)}\x94\x92\x94(]\x94(\x8c\x11pandas.core.index\x94\x8c\n_new_Index\x94\x93\x94h\x0b\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94\x8c\x01a\x94at\x94b\x8c\x04name\x94Nu\x86\x94R\x94h\rh\x0b\x8c\nInt64Index\x94\x93\x94}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x05\x85\x94h\x1f\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C(\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x94t\x94bh(Nu\x86\x94R\x94e]\x94h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01K\x05\x86\x94h\x1f\x8c\x02i4\x94K\x00K\x01\x87\x94R\x94(K\x03h5NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x14\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x94t\x94ba]\x94h\rh\x0f}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01\x85\x94h"\x89]\x94h&at\x94bh(Nu\x86\x94R\x94a}\x94\x8c\x060.14.1\x94}\x94(\x8c\x06blocks\x94]\x94}\x94(\x8c\x06values\x94h>\x8c\x08mgr_locs\x94\x8c\x08builtins\x94\x8c\x05slice\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94ua\x8c\x04axes\x94h\nust\x94bb.'

The method to_csv does support compression as a kwarg, 'gzip' and 'bz2':

In [390]:

df.to_csv(r'c:\data\df.zip', compression='bz2')
statinfo = os.stat(r'c:\data\df.zip')
print(statinfo.st_size)
29

I just think that it should be a by default behaviour of any saving algorithm to try to compress it. Seems I am wrong. — Romain Jouin, May 15 '15 at 09:00
@romainjouin but that would assume that you could open a compressed file always, there may be a system that couldn't decompress it whilst a plain text csv is readable on most systems — EdChum, May 15 '15 at 09:46

score 2 · Accepted Answer · edited May 23 '17 at 12:19

It is likely in your best interest to stash your CSV file in a database of some sort and perform operations on that rather than loading the CSV file to RAM, as Kathirmani suggested. You will see the speedup in loading time that you expect due simply to the fact that you are not filling up 800 Mb worth of RAM every time you load your script.

File compression and loading time are two conflicting elements of what you seem to be trying to accomplish. Compressing the CSV file and loading that will take more time; you've now added the extra step of having to decompress the file, which doesn't solve your problem.

Consider a precursory step to ship the data to an sqlite3 database, as described here: Importing a CSV file into a sqlite3 database table using Python.

You now have the pleasure of being able to query a subset of your data and quickly load it into a pandas.DataFrame for further use, as follows:

from pandas.io import sql
import sqlite3

conn = sqlite3.connect('your/database/path')
query = "SELECT * FROM foo WHERE bar = 'FOOBAR';"

results_df = sql.read_frame(query, con=conn)
...

Conversely, you can use pandas.DataFrame.to_sql() to save these for later use.

The time I was willing to win by zipping the file was the reading-disk-access-time. I think the 'on-the-fly' decompression would be done in-memory and so faster than accessing data on disk. — Romain Jouin, May 15 '15 at 12:19

score 0 · Answer 3 · answered May 15 '15 at 07:42

0

Dont load 800MB file to memory. It will increase your loading time. Pickle objects too takes more time to load. Instead store the csv file as a sqlite3 (which comes along with python) table. And then query the table every time depending upon your need.

answered May 15 '15 at 07:42

Kathirmani Sukumar

10,445
5
33
34

I am trying to use pandas to do data analysis. Are you suggesting pandas is not taylored to handle big data ? – Romain Jouin May 15 '15 at 07:46
Not like that. Using pandas only, you can query or filter the sqlite3 table directly. Storing the data in RAM will consume your RAM space. What if your data grows.??? – Kathirmani Sukumar May 15 '15 at 07:49
I have to say that the data doesn't have to grow to already bother my small 8 Gb :-s – Romain Jouin May 15 '15 at 12:52

score 0 · Answer 4 · answered Sep 15 '17 at 11:18

0

You can also use panda's pickle methods which should compress your data.

Save a dataframe:

df.to_pickle(filename)

Load it:

df = pd.read_pickle(filename)

answered Sep 15 '17 at 11:18

johannesmik

731
1
8
18

Pickling pandas dataframe multiplies by 5 the file size

4 Answers4

Linked