List operation with CUDF dataframe

Question

I have a Cudf dataframe which looks like this

The dtype of columns POSITION_ANTENNA1 and POSITION_ANTENNA2 are lists, and I want to construct a column = POSITION_ANTENNA1 - POSITION_ANTENNA2. However, it is giving me an error

Lists concatenation for this operation is not yetsupported

However, if I am converting the dataframe to Pandas it is working fine. Is there a way to do the simple list operation without converting it to pandas.

Edit:

Here is the operation I am trying to do

df_merged['BASELINE'] = df_merged.POSITION_ANTENNA1-df_merged.POSITION_ANTENNA2

And I am getting this error

However, if I am doing the following it is working fine

df_merged['BASELINE'] = df_merged.POSITION_ANTENNA1.to_pandas()-df_merged.POSITION_ANTENNA2.to_pandas()

"if I am converting the dataframe to Pandas it is working fine" - please show the code that works in pandas, the expected output, and how you are trying to call it with dask. — mdurant, May 25 '22 at 15:00
You should have scalars as values in your dataframe if you want to perform arithmetic like this. Dataframes were not intended to be efficient or convenient with sequences as values. — Paul H, May 25 '22 at 15:15
But how it is working in Pandas and not in cudf. The problem is the original data has N-D array as cell values and I want to keep the shape as it is. — Arpan Das, May 25 '22 at 15:17
Is it the cuDF nature of the dataframe that is the problem, or dask? — mdurant, May 25 '22 at 15:18
for me, this does not work in pandas. `df = pd.DataFrame({'pt1': [[35.2, -110.0], [47.3, -68.2]], 'pt2': [[34.8, -109.8], [46.8, -70.1]]}); df.pt2 - df.pt1` raises a similar error. I'd strongly recommend following Paul H's advice - pandas and dask are designed to work with columns of uniform numpy-compatible data types like float, int, string, not object types like lists. While you *can* hold objects in dataframes, math operations like this won't work as intended (note that `+` doesn't error, but it just concatenates the lists). If this is working for you in pandas, can you show us a [mre]? — Michael Delgado, May 25 '22 at 15:45
oh - just saw your comment that the cells are ndarrays. I mean you *can* do what SultanOrazbayev suggests below. But it would be a favor to your colleagues if you change the format so the dataframe performs better and is easier to work with. Otherwise you will always have to hack together workarounds like this for every operation. — Michael Delgado, May 25 '22 at 15:56
I wish I could change the dataframe but it is not upto me. It is a standard dataframe coming from telescopes and the whole community follow this format. — Arpan Das, May 25 '22 at 17:50

SultanOrazbayev · Answer 1 · 2022-05-25T18:12:12.840

This question is difficult to solve reliably without access to sample data, but the code snippet below should be a good starting point for adjusting to actual use-case.

As general advice, I'd recommend to first solve a smaller case using pandas (since both dask and cudf provide the ability to operate on pandas dataframes):

from pandas import DataFrame, concat

df = DataFrame({"a": [[1, 2], [3, 4]], "b": [[5, 7], [9, 11]]})


def calculate_difference(df):
    # create dfs using https://stackoverflow.com/a/35491399/10693596

    _a = DataFrame(df["a"].tolist(), columns=["0", "1"], index=df.index)
    _b = DataFrame(df["b"].tolist(), columns=["0", "1"], index=df.index)
    _diff = _a - _b
    return concat([df, _diff], axis=1)


print(calculate_difference(df))
#         a        b  0  1
# 0  [1, 2]   [5, 7] -4 -5
# 1  [3, 4]  [9, 11] -6 -7

In the function, we rely on this answer to first convert the data into columns with consistent indexing, and then find the difference in column values.

Assuming the above generates the desired result, we can map the function across dataframe chunks (since operations are done row-wise, there is no need for data exchange across partitions):

from dask.dataframe import from_pandas

# will use the pandas example to provide meta (highly recommended)
meta = calculate_difference(df)

ddf = from_pandas(df, npartitions=1)
ddf = ddf.map_partitions(calculate_difference, meta=meta)

print(ddf.compute())
#         a        b  0  1
# 0  [1, 2]   [5, 7] -4 -5
# 1  [3, 4]  [9, 11] -6 -7

For dask cudf, you could convert the dask cudf into dask dataframe:

from dask_cudf import from_cudf

# assuming df is a cudf dataframe
ddf = from_cudf(df, npartitions=2)

# will use the pandas example to provide meta (highly recommended)
meta = calculate_difference(df.head(3))
ddf = ddf.map_partitions(calculate_difference, meta=meta)

Thank you for the answer but the problem is I don't want to convert it to Pandas, I want to do all the operations in GPU. For panda the operation is much simpler like I mentioned in the original post. But I am looking for a solution to do the whole operation in CUDF or DASK_CUDF as there will be a batch parallel processing for huge amount of data and the main concern is the speed as this kind of operation is already done using X-array and Dask before. — Arpan Das, May 25 '22 at 18:04

score 1 · Answer 2 · answered Jul 01 '22 at 23:59

SultanOrazbayev is right (+1ed): you cannot do what you want with the way you're formatting your data in the dataframe. Personally, I'd explode out POSITION_ANTENNA1 and POSITION_ANTENNA2 into two separate dataframes, do my subtraction operation on the two separate dataframes, then bring the result into the cudf dataframe you want and delete the two antenna dataframes for space.

Please make a feature request in cuDF so we can track and prioritize this use.

List operation with CUDF dataframe

2 Answers2