Minimize the size of a file while saving a pandas dataframe

Question

I want to write a pandas dataframe to a file. I have about 200MB of csv data. Which file extension should I write to such that the file size is the minimum?

I am open to writing in binary as well as I will only be using the dataframe to work.

UPDATE: In my case using the compressed zip format worked the best (storage wise). But run time wise the pickle format(.pkl) was read and saved the fastest. I have not tried paraquet and feather due the additional dependencies it required.

Does this answer your question? [How to reversibly store and load a Pandas dataframe to/from disk](https://stackoverflow.com/questions/17098654/how-to-reversibly-store-and-load-a-pandas-dataframe-to-from-disk) — DataJanitor, Feb 14 '23 at 13:20

score 1 · Answer 1 · answered Feb 14 '23 at 13:17

You can simply compress your csv, using .zip extension instead of .csv:

# A zip archive with only one file
df.to_csv('export.zip')

# Or to get more control
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})

# You can read the file with
df = pd.read_csv('export.zip')

Dionisauce · Answer 2 · 2023-02-20T13:41:20.823

0

Writing to a parquet file may be a good option. Requires either pyarrow or fastparquet libraries. See documentation here


    df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
    df.to_parquet('df.parquet.gzip',
                  compression='gzip')  
    pd.read_parquet('df.parquet.gzip')

Parquet files can achieve high compression rates.

edited Feb 20 '23 at 13:41

answered Feb 14 '23 at 13:16

Dionisauce

1
1

score 0 · Answer 3 · answered Feb 14 '23 at 13:19

0

If you are saving your data to csv files, then pandas already has a built in compression keyword (doc)

you can use it like this:

df.to_csv("my_data.csv.zip", compression="zip")

answered Feb 14 '23 at 13:19

Nullman

4,179
2
14
30

ifly6 · Accepted Answer · 2023-02-14T14:33:06.047

I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.

To get answers for your data, take a sample of your data with df.sample(10_000) (or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.

import random
df = pd.DataFrame({
    'd': range(0, 10_000),
    's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
          for _ in range(0, 10_000)],
    'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})

I then queried the length of the following save formats.

l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
    df.to_csv('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
    df.to_pickle('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

for p in ['.xls', '.xlsx']:
    df.to_excel('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
    df.to_stata('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)

That yielded the following table:

            name    size     ratio
7    temp.pkl.xz   22532  0.110395
8   temp.pkl.bz2   23752  0.116372
13   temp.dta.xz   39276  0.192431
6    temp.pkl.gz   40619  0.199011
2    temp.csv.xz   42332  0.207404
14  temp.dta.bz2   51694  0.253273
3   temp.csv.bz2   54801  0.268495
12   temp.dta.gz   57513  0.281783
1    temp.csv.gz   70219  0.344035
4   temp.csv.zip   70837  0.347063
11      temp.dta  170912  0.837377
5       temp.pkl  180865  0.886141
0       temp.csv  204104  1.000000
10     temp.xlsx  216828  1.062341
9       temp.xls  711168  3.484341

I did not try to_parquet or to_feather because they require dependency pyarrow, which is non-standard in Anaconda.

Running the export to Excel 2003's format threw a warning that xlwt is no longer maintained and will be removed. Inasmuch as its Python implementation's file size is so huge, it is of no major loss.

score -1 · Answer 5 · answered Feb 14 '23 at 13:14

-1

Using standard Pandas library, pickle binary is the way to go. For a detailed information, you might find the following video to be useful

https://www.youtube.com/watch?v=u4rsA5ZiTls&t=150s

answered Feb 14 '23 at 13:14

Yutaro Watanabe

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 17 '23 at 20:24

Minimize the size of a file while saving a pandas dataframe

5 Answers5