16

What are the fundamental difference and primary use-cases for Dask | Modin | Data.table

I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Shubham Samant
  • 171
  • 1
  • 5

2 Answers2

13

I'm trying to decide which tool to learn of the three for parallel / out-of-memory computing: dask, modin or datatable (pandas is not a parallel tool, nor is aimed at out-of-memory computing).

Didn't find any out-of-memory tools in datatable documentation (discussed here), hence I'm only focusing on modin and dask.

In short modin is trying to be a drop-in replacement for the pandas API, while dask is lazily evaluated. modin is a column store, while dask partitions data frames by rows. The distribution engine behind dask is centralized, while that of modin (called ray) is not. Edit: Now modin supports dask as calculation engine too.

dask was the first, has large eco-system and looks really well documented, discussed in forums and demonstrated on videos. modin (ray) has some design choices which allow it to be more flexible in terms of resilience for hardware errors and high-performance serialization. ray aims at being most useful in AI research, but modin itself is of general use. ray also aims at real-time applications to support real-time reinforcement learning better.

More details here and here.

Mark Horvath
  • 1,136
  • 1
  • 9
  • 24
7

I have a task of dealing with daily stock trading data and came across this post. The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in read_csv and groupby mean. Based upon this little test my choice is dask. Below is a comparison of the 3:

| library      | `read_csv` time | `groupby` time |
|--------------|-----------------|----------------|
| modin        | 175s            | 150s           |
| dask         | 0s (lazy load)  | 27s            |
| dask persist | 26s             | 1s             |
| datatable    | 8s              | 6s             |

It seems that modin is not as efficient as dask at the moment, at least for my data. dask persist tells dask that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. datatable originally has all data in memory and is super fast in both read_csv and groupby. However, given its incompatibility with pandas it seems better to use dask. Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python. If datatable in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.

Albert Zhao
  • 106
  • 1
  • 6
  • 2
    How many cores did you distribute to? Could it be that `modin` didn't help due to being a column store, while `dask` partitioning by rows? – Mark Horvath Jun 08 '20 at 22:52
  • Did you also save the compute time of pandas itself as the baseline ? Also surprised about modin results – Sylvain Jun 12 '20 at 11:11
  • It has been a while but my memory is that I did not distribute cores, so I should have used the default settings. It was a little test so I think I just record the wall time and did not dig deeper. – Albert Zhao Jun 13 '20 at 12:25
  • I think finally my choice is to use the default pandas read_csv although the loading is slow. I think I did not choose dask because after many rounds of tweaking my code, getting errors and so on so forth, it was not that fast as I expected for other manipulations of data. So I don't know if these packages are improving or any other suggestions? Is vaex good? I didn't try this but someday I certainly will start another round of searching and testing... – Albert Zhao Jun 13 '20 at 12:31
  • Cylon provides a DataFrame API with a fully distributed execution. It may be faster for this use case. github.com/cylondata/cylon, Disclaimer, I'm with the Cylon project. – Supun Kamburugamuva Apr 09 '21 at 15:16