0

I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading'))

I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet") it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h goes up from about 40megabytes to 40+gig.

I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.

J.Ewa
  • 205
  • 3
  • 14
  • 1
    Where is the data from, how are you loading it? What processing are you applying to it? Hare you tried the distributed scheduler or otherwise looked at any diagnostics? – mdurant Sep 21 '21 at 16:15
  • The data comes from wrds database it is financial data, we apply a lot of groupby, do some sorting and finally merge with another data set(compu stat and crsp). I tried the distributed scheduler(local machine) it runs out of memory at 4gig. It won't work in the google virtual machine we use. I looked at the status at port 8787 but haven't been able to diagnose anything other than it runs out of memory after 4gig in my machine. However, I thought dask by default uses available cores to process data in parallel? – J.Ewa Sep 21 '21 at 17:36
  • @J.Ewa- that is so discouraging to read this. I was counting on Dask to resolve the memory problem. But after the reading this, I have to rethink. Dask needs to minimize the RAM usage when writing to a file. Not sure why this is not possible. – Nguai al Jul 01 '22 at 14:25
  • @Nguaial I hope you haven't given up, in this post - https://stackoverflow.com/questions/72440603/dask-dataframe-parallel-task/74236686#74236686 I listed the things that helped me solve the issues with dask. – J.Ewa Oct 28 '22 at 14:18

1 Answers1

0

Dask computations are executed lazily. The underlying operations aren't actually executed until the last possible moment. Here's what I can gather from your question / comment:

  • you read a 40GB dataset
  • you run grouping / sorting
  • you join with other datasets
  • you try to write to Parquet

The computation bottleneck isn't necessarily related to the Parquet writing part. Your bottleneck may be with the grouping, sorting, or joining.

You may need to perform a broadcast join, strategically persist, or repartition, it's hard to say given the information provided.

Powers
  • 18,150
  • 10
  • 103
  • 108