I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values.
I've managed to read them in previously in a csv
(tab-separated) file and I successfully read them to a 50 cores Xeon machine with 250GB RAM and try to write it out as a .parq
directory as such:
The floats in huge.csv
were saved as strings and it is 125GB.
import dask.dataframe as dd
filename = 'huge.csv'
df = dd.read_csv(filename, delimiter='\t', sample=500000000)
df.to_parquet('huge.parq')
It has been writing to the huge.parq
for close to a week and the directory is 14GB and it seems like the process of saving .to_parquet
is not going to stop any time soon.
And free -mh
is showing that there's still memory left available but the time it's taking to save the .parq
directory is tremendously slow:
$ free -mh
total used free shared buff/cache available
Mem: 251G 98G 52G 10M 101G 152G
Swap: 238G 0B 238G
The questions are:
Given the size of the dataframe and machine, is it feasible to save the dask dataframe to a parquet file at all?
Is it normal for
dask
andfastparquet
to take so long to save huge dataframes?Is there some way to estimate the time it will take to save a parquet file?