pyspark row number dataframe

Question

I have a dataframe, with columns time,a,b,c,d,val. I would like to create a dataframe, with additional column, that will contain the row number of the row, within each group, where a,b,c,d is a group key.

I tried with spark sql, by defining a window function, in particular, in sql it will look like this:

select time, a,b,c,d,val, row_number() over(partition by a,b,c,d order by     time) as rn from table
group by a,b,c,d,val

I would like to do this on the dataframe itslef, without using sparksql.

Thanks

What to you mean by _without using sparksql_? – zero323 May 23 '16 at 15:53 — zero323, May 23 '16 at 15:53

Carlos Vilchez · Answer 1 · 2017-09-07T13:17:31.613

19

I don't know the python api too much, but I will give it a try. You can try something like:

from pyspark.sql import functions as F

df.withColumn("row_number", F.row_number().over(Window.partitionBy("a","b","c","d").orderBy("time"))).show()

edited Sep 07 '17 at 13:17

answered May 23 '16 at 14:05

Carlos Vilchez

2,774
28
30

Yes, that's the same as I did, you miss the partition part, df = df.withColumn("id",F.rowNumber().over(Window.partitionBy("a","b","c","d").orderBy(col("time")))), But I would like to do it without that. Thanks – matlabit May 23 '16 at 14:30
Note, that spark <=1.6 uses different function name rowNumber() – y.selivonchyk Jun 07 '17 at 20:26
Thanks @laguittemh – Carlos Vilchez Sep 07 '17 at 13:18
@CarlosVilchez is it necessary to use the `orderby` part ? can we just add the `row_number` reserving the natural ordering without ordering it ? – PolarBear10 Jul 30 '18 at 10:27
@Matthew you may need to create a new question for that. There may be some complexities I don't see right off the top of my head, but you need orderby and probably a new column with the row_number to use it. – Carlos Vilchez Jul 30 '18 at 10:56
Could you create a new question in SO for your case? It would give more visibility to your problem. – Carlos Vilchez Jul 30 '18 at 13:30
Works pwerfectly. I just had to specify descending order. `df.withColumn("rank", F.row_number().over(Window.partitionBy(df["a"]).orderBy(df['time'].desc())))` – Danstan Sep 22 '20 at 14:59

pyspark row number dataframe

1 Answers1

Linked