Questions tagged [cudf]

Use this tag for questions specifically related to the cuDF Library, or cuDF DataFrame manipulations.

From PyPI: The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.

146 questions
6
votes
4 answers

How do I install cudf using pip?

I wanted to accelerate pandas on my GPU so I decided to use cudf library. Please do suggest other libraries(if any). I tried to install cudf using pip by pip3.6 install cudf-cuda92. The pip version is 19.2.3(latest). When I run pip3.6 install…
rahul_5409
  • 71
  • 1
  • 1
  • 5
6
votes
2 answers

How to do a matrix dot product in the GPU with rapids.ai

I'm using CUDF it's part of the rapids ML suite from Nvidia. Using this suite how would I do a dot product? df = cudf.DataFrame([('a', list(range(20))), ('b', list(reversed(range(20)))), ('c', list(range(20)))]) e.g. how would I perform a dot…
Pablojim
  • 8,542
  • 8
  • 45
  • 69
4
votes
0 answers

How to convert a cudf.core.dataframe.DataFrame into a pandas.DataFrame?

I have a cudf dataframe type(pred) > cudf.core.dataframe.DataFrame print(pred) > action 1778378 0 1778379 1 1778381 1 1778383 0 1778384 0 ... ... 2390444 0 2390446 0 2390478 0 2390481 …
Soerendip
  • 7,684
  • 15
  • 61
  • 128
4
votes
0 answers

GPU based combinatoric resolver with table group by operations

Given a table with many columns |-------|-------|-------|-------| | A | B | .. | N | |-------|-------|-------|-------| | 1 | 0 | .. | X | | 2 | 0 | .. | Y | | .. | .. | .. | .. …
Reacher234
  • 230
  • 2
  • 11
4
votes
2 answers

Recommended cudf Dataframe Construction

I'm interested in recommended and fast ways of creating cudf DataFrames from dense numpy objects. I have seen many examples of splitting out columns of a 2d numpy matrix to tuples then calling cudf.DataFrame on a list of tuples -- this is rather…
quasiben
  • 1,444
  • 1
  • 11
  • 19
3
votes
1 answer

Replace integers with np.NaN in cudf dataframe

I have a dataframe like this df_a = cudf.DataFrame() df_a['key'] = [0, 1, 2, 3, 4] df_a['values'] = [1,2,np.nan,3,np.nan] and I would like to replace all 2s with np.nan usually in pandas dataframe I would use df_a[df_a==2]=np.nan but in cudf…
paka
  • 55
  • 7
3
votes
1 answer

Why is polars called the fastest dataframe library, isn't dask with cudf more powerfull?

Most of the benchmarks have dask and cuDF isolated, but i can use them together. Wouldn't Dask with cuDF be faster than polars?! Also, Polars only runs if the data fits in memory, but this isn't the case with dask. So why is there…
zacko
  • 179
  • 2
  • 9
3
votes
3 answers

install cudf on databricks

I am trying to use cudf on databricks. I started following https://medium.com/rapids-ai/rapids-can-now-be-accessed-on-databricks-unified-analytics-platform-666e42284bd1. But the init script link is broken. Then, I followed this link…
Etienne Herlaut
  • 526
  • 4
  • 12
3
votes
1 answer

Rolling linear regression for use with groupby operation on a cuDF dataframe

I would like to calculate the rolling slope of y_value over x_value using cuML LinearRegression. Sample data (cuDF dataframe): | date | x_value | y_value | | ------ | ------ | ---- | | 2020-01-01 | 900 | 10 | | 2020-01-01 |…
nasiha
  • 31
  • 1
3
votes
4 answers

In-memory database optimized for read (low/no writes) when operations involve sorting, aggregating, and filtering on any column

I am looking to load ~10GB of data into memory and perform SQL on it in the form of: Sort on a single column (any column) Aggregate on a single column (any column) Filter on a single column (any column) What might be a good choice for performance?…
David542
  • 104,438
  • 178
  • 489
  • 842
3
votes
1 answer

What is the relationship between BlazingSQL and dask?

I'm trying to understand if BlazingSQL is a competitor or complementary to dask. I have some medium-sized data (10-50GB) saved as parquet files on Azure blob storage. IIUC I can query, join, aggregate, groupby with BlazingSQL using SQL syntax, but I…
Dave Hirschfeld
  • 768
  • 2
  • 6
  • 15
3
votes
1 answer

How do you determine memory stats while using rapids.ai?

I'm using python libraries of rapids.ai and one of the key things I'm starting to wonder is: how do I inspect memory allocation programatically? I know I can use nvidia-smi to look at some overall high level stats, but specifically I woud like to…
Robert
  • 1,220
  • 16
  • 19
3
votes
1 answer

How to read a single large parquet file into multiple partitions using dask/dask-cudf?

I am trying to read a single large parquet file (size > gpu_size), using dask_cudf/dask but it is currently reading it into a single partition, which i am guessing is the expected behavior inferring from the doc-string:…
Vibhu Jawa
  • 88
  • 9
3
votes
1 answer

Running RAPIDS without GPU for development?

Is there a way to run RAPIDS without a GPU? I usually develop on a small local machine without a GPU, then push my code to a powerful remote server for real use. Things like TensorFlow allow switching between the CPU and GPU depending on if they're…
golmschenk
  • 11,736
  • 20
  • 78
  • 137
2
votes
0 answers

how to convert 'dask_cudf' column to datetime?

How can we convert a dask_cudf column of string or nanoseconds to a datetime object? to_datetime is available in pandas and cudf. See sample data below import pandas import cudf # with pandas df = pandas.DataFrame( {'city' :…
dleal
  • 2,244
  • 6
  • 27
  • 49
1
2 3
9 10