1

here is the code

import dask.dataframe as dd
import pandas as pd


df = pd.DataFrame({
'col1' : ['A', 'A', 'E', np.nan, 'D', 'C','B','C'],
'col2' : [2, 1, 9, 8, 7, 4,10,5],
'col3': [0, 1, 9, 4, 2, 3,1,2],
'col4': [11,12,12,13,14,55,56,22], })

out_1=df.loc[::-1,"col4"]

dd_df=dd.from_pandas(df,npartitions=5)
out_2=dd_df.loc[::-1,"col4"]

#out_2 throws an error

I know that Dask doesn't work the same way as pandas. How can I get the same output like out_1 with DASK?

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
Coder
  • 1,129
  • 10
  • 24
  • related: https://stackoverflow.com/questions/72736071/is-there-a-way-to-traverse-through-a-dask-dataframe-backwards/72737000#72737000 – Michael Delgado Sep 12 '22 at 00:14

1 Answers1

1

You can reverse the order of the partitions, and also schedule a job to reverse the order of rows within each partition, like so:

In [30]: rev_df = dd.concat(
    ...:     [df.partitions[i] for i in range(df.npartitions - 1, -1, -1)]
    ...: ).map_partitions(lambda x: x[::-1], meta=df)
    ...:

In [31]: rev_df.compute()
Out[31]:
  col1  col2  col3  col4
7    C     5     2    22
6    B    10     1    56
5    C     4     3    55
4    D     7     2    14
3  NaN     8     4    13
2    E     9     9    12
1    A     1     1    12
0    A     2     0    11

This would work the same way for a column or series:

rev_col1 = dd.concat(
    [
        df["col1"].partitions[i]
        for i in range(df.npartitions - 1, -1, -1)
    ]
).map_partitions(lambda x: x[::-1], meta=df["col1"])
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Thanks! i am not sure it is an optimized answer but it works in a way that i want – Coder Sep 15 '22 at 15:40
  • 1
    I don't see any reason why this should be slow - reversing the order of the partitions is very fast and mapping the order reversal job to workers uses your cluster to schedule and lazily complete the row-order reversal. it's not pretty but I don't think it's bad in terms of performance. open to feedback though if you see a bottleneck or something. – Michael Delgado Sep 15 '22 at 16:09
  • it works perfectly with the data i am having. could you please tell me `.map_partitions(lambda x: x[::-1], meta=create_label_dask["case:RequestedAmount"])` what this part does? because ` x: x[::-1]` i wonder how this operation works here. – Coder Sep 15 '22 at 16:22
  • 1
    [`map_partitions`](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) is an important one to know for dask.dataframe. Dask DataFrames are just a dictionary of pandas dataframes + indexing/scheduling logic + a connection to the dask scheduler. That's oversimplifying of course, but that's the gist. map_partitions runs a function on each pd.DataFrame partition. so call `df[::-1]` in pandas and you'll get what this does. you need to provide meta to map_partitions so dask knows the types to expect back from the function. – Michael Delgado Sep 15 '22 at 16:33
  • Thank you. this is good information – Coder Sep 15 '22 at 16:59