I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading')
)
I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet")
it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h
goes up from about 40megabytes to 40+gig.
I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.