How to create unique ID column in DASK_CUDF

Question

How to create unique id column in dsak cudf dataframe across all the partitions So far I am using following technique, but if I increase data to more than 10cr rows it is giving me memory error.

def unique_id(df):
    rag = cupy.arrange(len(df))
    df['unique_id']=rag
    return df
    
part = data.npartitions
data = data.repartitions(npartitions=1)
cols_meta={c:str(data[c].dtype) for c in data.columns}
data = data.map_partitions(lambda df:unique_id(df), meta={**cols_meta,'unique_id'})
data = data.repartitions(npartitions=part)

If there's any other way, or any modification in code, please suggest. Thank you for help

score 3 · Accepted Answer · answered May 19 '21 at 13:17

I was doing that because wanted to create ids sequentially, till the length data.

The other suggestions will likely work. However, one of the easiest way to do this is to create a temporary column with value 1 and use cumsum, like the following:

import cudf
import dask_cudf

df = cudf.DataFrame({
    "a": ["dog"]*10
})
ddf = dask_cudf.from_cudf(df, 3)

ddf["temp"] = 1
ddf["monotonic_id"] = ddf["temp"].cumsum()
del ddf["temp"]

print(ddf.partitions[2].compute())
     a  monotonic_id
8  dog             9
9  dog            10

As expected, the two rows in the partition index 2 have IDs 9 and 10. If you need the indexes to start at 0, you can subtract 1.

Looks great, though I can't test it on my machine right now. — SultanOrazbayev, May 19 '21 at 13:26

score 2 · Answer 2 · answered May 19 '21 at 08:43

2

The reason why you are running into memory error is this step:

data = data.repartitions(npartitions=1)

By having a single partition you are forcing all the data on a single worker, which will cause memory problems as the dataset increases in size. What you want to do instead is assign a unique identifier while maintaining each partition, see this answer.

answered May 19 '21 at 08:43

SultanOrazbayev

14,900
3
16
46

I was doing that because wanted to create ids sequentially, till the length data. Same function I tried using map_partitions then in every partition it will start from 0, because of that I will get duplicates. I tried that solution, it is creating random ids. Is there any option to create sequential ids till the length of dataframe. – A14 May 19 '21 at 09:08
1

The link I provided creates sequential id. – SultanOrazbayev May 19 '21 at 09:09
Thank you. Its working perfectly in recent dask version. But in dask_cudf version 0.18 it is not working as expected. – A14 May 19 '21 at 09:43
if I have duplicate index den I will not get unique ids right? – A14 May 19 '21 at 10:23

How to create unique ID column in DASK_CUDF

2 Answers2