Pyspark agregate sort and score

Question

Let's say a table A:

Year          product     rating
07/02/2018    A           good
07/02/2017    B           good
07/02/2016    A           bad
07/02/2015    C           medium
07/02/2016    C           bad

In a first phase, I wish to obtain the following table:

product  year          score 
A        07/02/2018     1
A        07/02/2016     3
B        07/02/2017     1
C        07/02/2016     3
C        07/02/2015     2

The second phase:

product  year          score for oldest date 
A        07/02/2016     3
B        07/02/2017     1
C        07/02/2015     2

what is the shortest way to do this ? ( ranking, change scores from strings to numbers, aggeregate). Can I do all these steps in one shot ?

Thank you

score 1 · Accepted Answer · answered Mar 14 '18 at 19:52

You first need to map the column to an Integer value. You can use Map -

diz = {k:str(v) for k,v in zip(rating.keys(),rating.values())}

check this replace values of one column in a spark df by dictionary key-values (pyspark)

Then use sorting/aggregation to get the desired output.

titipata · Answer 2 · 2018-03-15T15:24:28.250

Just to make Abhishek solution more concrete, suppose you have dataframe given above.

First phase

import pyspark.sql.functions as fn
from pyspark.sql.types import * 

# replace rating to score
df2 = df.na.replace({'good': '2', 'bad': '0', 'medium': '1'}, value=1, subset='rating')
# cast rating score to integer
df2 = df2.select('date', 'product', fn.col('rating').cast(IntegerType()).alias('score'))
# sorting by product and score
df2.sort(fn.col('product'), fn.desc('score')).show()

You can also just use fn.col('score') if you want to sort ascending instead.

Second phase

Pretty much the same as in first one. However, you have to apply unix_timestamp to parse date and sort by product and parsed date.

df2 = df.na.replace({'good': '2', 'bad': '0', 'medium': '1'}, value=1, subset='rating')
df2 = df2.select('date', 'product', 
                 fn.col('rating').cast(IntegerType()).alias('score'), 
                 fn.unix_timestamp('date', format='MM/dd/yyyy').alias('date_dt'))
df2 = df2.sort(fn.col('product'), fn.asc('date_dt'))

Then groupby product and grab first date and score

df2.groupby(fn.col('product')).agg(fn.first(fn.col('date')), fn.first(fn.col('score'))).show()

>>
+-------+------------------+-------------------+
|product|first(date, false)|first(score, false)|
+-------+------------------+-------------------+
|      B|        07/02/2017|                  2|
|      C|        07/02/2015|                  1|
|      A|        07/02/2016|                  0|
+-------+------------------+-------------------+

What if I want to realize the second phase ( please refer to the main question) Thank you in advance — Lizou, Mar 15 '18 at 14:52
@Lizou, I updated the solution. Let me know if it works properly! — titipata, Mar 15 '18 at 15:25

Pyspark agregate sort and score

2 Answers2