I'm trying to understand if BlazingSQL is a competitor or complementary to dask.
I have some medium-sized data (10-50GB) saved as parquet files on Azure blob storage.
IIUC I can query, join, aggregate, groupby with BlazingSQL using SQL syntax, but I can also read the data into CuDF using dask_cudf
and do all same operations using python/dataframe syntax.
So, it seems to me that they're direct competitors?
Is it correct that (one of) the benefits of using dask is that it can operate on partitions so can operate on datasets larger than GPU memory whereas BlazingSQL is limited to what can fit on the GPU?
Why would one choose to use BlazingSQL rather than dask?
Edit:
The docs talk about dask_cudf
but the actual repo is archived saying that dask support is now in cudf
itself. It would be good to know how to leverage dask
to operate on larger-than-gpu-memory datasets with cudf