Inverse a column in pyspark

Question

 a   b 
'1'  1
'2'  2
'3'  3
'4'  4

I would like to insert a new column which is the inverse of the b column while keeping the other columns constant. Example:

 a   b  c
'1'  1  4
'2'  2  3
'3'  3  2 
'4'  4  1

We use temp['b'][::-1] to achieve this result in pandas. Is this transformation possible in pyspark as well?

How do you define ordering in the dataframe? There's no concept of an index in a spark dataframe since the data is possibly distributed so you need to define an ordering criteria with orderby. — rchome, Nov 29 '21 at 07:57
@NirHedvat 'b' column isn't consecutive. It can contain any numeric value. — , Nov 29 '21 at 07:59
@rchrome The dataframe is already ordered by the column 'a'. — , Nov 29 '21 at 08:05

ggordon · Accepted Answer · 2021-11-29T15:40:49.127

0

Let's say your dataframe is ordered by column a.

You could try performing a self-join on a generated column that reverses the order. Such a column, rn, could be generated using row_number eg

Using pyspark api

from pyspark.sql import functions as F
from pyspark.sql import Window

output_df = (
    df.withColumn(
        "rn",
        F.row_number().over(Window.orderBy("a"))
    )
    .alias("df1")
    .join(
        df.withColumn(
            "rn",
            F.row_number().over(Window.orderBy(F.col("a").desc()))
        ).alias("df2"),
        ["rn"],
        "inner"
    )
    .selectExpr("df1.a","df1.b","df2.b as c")
    
)

Using spark sql

    select
        df1.a,
        df1.b,
        df2.b as c
    from (
        select
            *,
            row_number() over (order by a) rn
        from 
            df
    ) df1
    INNER JOIN (
        select
            b,
            row_number() over (order by a desc) rn
        from 
            df
    ) df2 on df1.rn=df2.rn;

a	b	c
1	1	4
2	2	3
3	3	2
4	4	1

View on DB Fiddle

edited Nov 29 '21 at 15:40

answered Nov 29 '21 at 08:13

ggordon

9,790
2
14
27

Don't window functions without partitioning result in all data being collected to a single node? – Hristo Iliev Nov 29 '21 at 08:27
@HristoIliev There is a good response in response to the question in your comment [here](https://stackoverflow.com/questions/41313488/avoid-performance-impact-of-a-single-partition-mode-in-spark-window-functions) – ggordon Nov 29 '21 at 15:46
So `F.row_number().over(Window.orderBy("a"))` may blow up the memory of some executor in the general case, if `df` is large enough? Or is there some distributed magic I'm missing here? – Hristo Iliev Nov 30 '21 at 09:02

Inverse a column in pyspark

1 Answers1

Using pyspark api

Using spark sql