8

I have a dataframe, with columns time,a,b,c,d,val. I would like to create a dataframe, with additional column, that will contain the row number of the row, within each group, where a,b,c,d is a group key.

I tried with spark sql, by defining a window function, in particular, in sql it will look like this:

select time, a,b,c,d,val, row_number() over(partition by a,b,c,d order by     time) as rn from table
group by a,b,c,d,val

I would like to do this on the dataframe itslef, without using sparksql.

Thanks

matlabit
  • 838
  • 2
  • 13
  • 31

1 Answers1

19

I don't know the python api too much, but I will give it a try. You can try something like:

from pyspark.sql import functions as F

df.withColumn("row_number", F.row_number().over(Window.partitionBy("a","b","c","d").orderBy("time"))).show()
Carlos Vilchez
  • 2,774
  • 28
  • 30
  • Yes, that's the same as I did, you miss the partition part, df = df.withColumn("id",F.rowNumber().over(Window.partitionBy("a","b","c","d").orderBy(col("time")))), But I would like to do it without that. Thanks – matlabit May 23 '16 at 14:30
  • Note, that spark <=1.6 uses different function name rowNumber() – y.selivonchyk Jun 07 '17 at 20:26
  • Thanks @laguittemh – Carlos Vilchez Sep 07 '17 at 13:18
  • @CarlosVilchez is it necessary to use the `orderby` part ? can we just add the `row_number` reserving the natural ordering without ordering it ? – PolarBear10 Jul 30 '18 at 10:27
  • @Matthew you may need to create a new question for that. There may be some complexities I don't see right off the top of my head, but you need orderby and probably a new column with the row_number to use it. – Carlos Vilchez Jul 30 '18 at 10:56
  • Could you create a new question in SO for your case? It would give more visibility to your problem. – Carlos Vilchez Jul 30 '18 at 13:30
  • Works pwerfectly. I just had to specify descending order. `df.withColumn("rank", F.row_number().over(Window.partitionBy(df["a"]).orderBy(df['time'].desc())))` – Danstan Sep 22 '20 at 14:59