3

Most of the benchmarks have dask and cuDF isolated, but i can use them together. Wouldn't Dask with cuDF be faster than polars?!

Also, Polars only runs if the data fits in memory, but this isn't the case with dask. So why is there https://h2oai.github.io/db-benchmark/ an out of memory indication for dask?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
zacko
  • 179
  • 2
  • 9
  • 2
    this seems like a question for the writer of that benchmark... maybe you could add that case to the script, or start an issue on github? SO isn't really the right forum for weighing the merits of an offsite benchmarking test - see [ask] – Michael Delgado Jun 15 '22 at 21:12
  • That's not a bad idea. – zacko Jun 16 '22 at 08:12
  • 3
    "honesty is hard": https://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks – mdurant Jun 16 '22 at 16:13
  • Do they even use the distributed backend (even on one machine)? That's the only one that can properly deal with not overusing memory (and it should always be used IMO) – creanion Jun 17 '22 at 08:55
  • Polars can use lazy processing, so memory isn't a limitation for a dataframe size. – misantroop Mar 16 '23 at 20:28
  • Read the Benchmark manual first; the report was done in single node envrionment and `dask` and `cuDF` were considered two separate libraries here; `dask` with single node hardly outperforms `polars`, it surely beats `pandas` quite often. `cuDF` on the benchmark was still very fast and many cases faster than `polars`. And Finally, `polars` now has a streaming mode, which can comfortably deal with data not fit in memory. `polars` was born to be, in some way at least, a much faster `pandas`, in this sense, `polars` has done a pretty good job. Remember `polars` has a `rust` backend. – stucash Mar 31 '23 at 15:11

1 Answers1

3

Different dataframe libraries have their strengths and weaknesses. For example, see this blog post for a comparison of different libraries, esp. from a scaling pandas perspective.

Dask Dataframe comes with some default assumptions on how best to divide the workload among multiple tasks. If these assumptions are not be valid for the particular use-case, then it's not uncommon to see memory-related errors.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46