What are the fundamental difference and primary use-cases for Dask | Modin | Data.table
I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations
What are the fundamental difference and primary use-cases for Dask | Modin | Data.table
I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations
I'm trying to decide which tool to learn of the three for parallel / out-of-memory computing: dask
, modin
or datatable
(pandas
is not a parallel tool, nor is aimed at out-of-memory computing).
Didn't find any out-of-memory tools in datatable
documentation (discussed here), hence I'm only focusing on modin
and dask
.
In short modin
is trying to be a drop-in replacement for the pandas
API, while dask
is lazily evaluated. modin
is a column store, while dask
partitions data frames by rows. The distribution engine behind dask
is centralized, while that of modin
(called ray
) is not. Edit: Now modin
supports dask
as calculation engine too.
dask
was the first, has large eco-system and looks really well documented, discussed in forums and demonstrated on videos. modin
(ray
) has some design choices which allow it to be more flexible in terms of resilience for hardware errors and high-performance serialization. ray
aims at being most useful in AI research, but modin
itself is of general use. ray
also aims at real-time applications to support real-time reinforcement learning better.
I have a task of dealing with daily stock trading data and came across this post. The length of my rows is about 60 million and length of the columns is below 10. I tested with all 3 libraries in read_csv
and groupby mean
. Based upon this little test my choice is dask
. Below is a comparison of the 3:
| library | `read_csv` time | `groupby` time |
|--------------|-----------------|----------------|
| modin | 175s | 150s |
| dask | 0s (lazy load) | 27s |
| dask persist | 26s | 1s |
| datatable | 8s | 6s |
It seems that modin
is not as efficient as dask
at the moment, at least for my data. dask persist
tells dask
that your data could fit into memory so it take some time for dask to put everything in instead of lazy loading. datatable
originally has all data in memory and is super fast in both read_csv and groupby. However, given its incompatibility with pandas it seems better to use dask
. Actually I came from R and I was very familiar with R's data.table so I have no problem applying its syntax in python. If datatable
in python could seamlessly connected to pandas (like it did with data.frame in R) then it would have been my choice.