0
 a   b 
'1'  1
'2'  2
'3'  3
'4'  4

I would like to insert a new column which is the inverse of the b column while keeping the other columns constant. Example:

 a   b  c
'1'  1  4
'2'  2  3
'3'  3  2 
'4'  4  1

We use temp['b'][::-1] to achieve this result in pandas. Is this transformation possible in pyspark as well?

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Do you know the range of column 'b'? Is it consecutive? – Nir Hedvat Nov 29 '21 at 07:57
  • 4
    How do you define ordering in the dataframe? There's no concept of an index in a spark dataframe since the data is possibly distributed so you need to define an ordering criteria with orderby. – rchome Nov 29 '21 at 07:57
  • @NirHedvat 'b' column isn't consecutive. It can contain any numeric value. –  Nov 29 '21 at 07:59
  • @rchrome The dataframe is already ordered by the column 'a'. –  Nov 29 '21 at 08:05

1 Answers1

0

Let's say your dataframe is ordered by column a.

You could try performing a self-join on a generated column that reverses the order. Such a column, rn, could be generated using row_number eg

Using pyspark api

from pyspark.sql import functions as F
from pyspark.sql import Window

output_df = (
    df.withColumn(
        "rn",
        F.row_number().over(Window.orderBy("a"))
    )
    .alias("df1")
    .join(
        df.withColumn(
            "rn",
            F.row_number().over(Window.orderBy(F.col("a").desc()))
        ).alias("df2"),
        ["rn"],
        "inner"
    )
    .selectExpr("df1.a","df1.b","df2.b as c")
    
)

Using spark sql

    select
        df1.a,
        df1.b,
        df2.b as c
    from (
        select
            *,
            row_number() over (order by a) rn
        from 
            df
    ) df1
    INNER JOIN (
        select
            b,
            row_number() over (order by a desc) rn
        from 
            df
    ) df2 on df1.rn=df2.rn;
a b c
1 1 4
2 2 3
3 3 2
4 4 1

View on DB Fiddle

ggordon
  • 9,790
  • 2
  • 14
  • 27
  • Don't window functions without partitioning result in all data being collected to a single node? – Hristo Iliev Nov 29 '21 at 08:27
  • @HristoIliev There is a good response in response to the question in your comment [here](https://stackoverflow.com/questions/41313488/avoid-performance-impact-of-a-single-partition-mode-in-spark-window-functions) – ggordon Nov 29 '21 at 15:46
  • So `F.row_number().over(Window.orderBy("a"))` may blow up the memory of some executor in the general case, if `df` is large enough? Or is there some distributed magic I'm missing here? – Hristo Iliev Nov 30 '21 at 09:02