I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.
To get answers for your data, take a sample of your data with df.sample(10_000)
(or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.
import random
df = pd.DataFrame({
'd': range(0, 10_000),
's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
for _ in range(0, 10_000)],
'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})
I then queried the length of the following save formats.
l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
df.to_csv('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
df.to_pickle('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.xls', '.xlsx']:
df.to_excel('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
df.to_stata('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)
That yielded the following table:
name size ratio
7 temp.pkl.xz 22532 0.110395
8 temp.pkl.bz2 23752 0.116372
13 temp.dta.xz 39276 0.192431
6 temp.pkl.gz 40619 0.199011
2 temp.csv.xz 42332 0.207404
14 temp.dta.bz2 51694 0.253273
3 temp.csv.bz2 54801 0.268495
12 temp.dta.gz 57513 0.281783
1 temp.csv.gz 70219 0.344035
4 temp.csv.zip 70837 0.347063
11 temp.dta 170912 0.837377
5 temp.pkl 180865 0.886141
0 temp.csv 204104 1.000000
10 temp.xlsx 216828 1.062341
9 temp.xls 711168 3.484341
I did not try to_parquet
or to_feather
because they require dependency pyarrow
, which is non-standard in Anaconda.
Running the export to Excel 2003's format threw a warning that xlwt
is no longer maintained and will be removed. Inasmuch as its Python implementation's file size is so huge, it is of no major loss.