I am not sure if the order is guaranteed to be maintained for the groupBy()
. However, here is alternative way to do what you want that will work.
Use pyspark.sql.Window
to partition and order the DataFrame as desired. Then use pyspark.sql.DataFrame.distinct()
to drop the duplicate entries.
For example:
Create Dummy Data
data = [
(123, '2017-07-01', 50),
(123, '2017-01-01', 100),
(345, '2018-01-01', 20),
(123, '2017-03-01', 25),
(345, '2018-02-01', 33)
]
df = sqlCtx.createDataFrame(data, ['ProductId', 'Date', 'Price'])
df.show()
#+---------+----------+-----+
#|ProductId| Date|Price|
#+---------+----------+-----+
#| 123|2017-07-01| 50|
#| 123|2017-01-01| 100|
#| 345|2018-01-01| 20|
#| 123|2017-03-01| 25|
#| 345|2018-02-01| 33|
#+---------+----------+-----+
Use Window
Use Window.partitionBy('ProductId').orderBy('Date')
:
import pyspark.sql.functions as f
from pyspark.sql import Window
df.select(
'ProductId',
f.first('Price').over(Window.partitionBy('ProductId').orderBy('Date')).alias('Price')
).distinct().show()
#+---------+-----+
#|ProductId|Price|
#+---------+-----+
#| 123| 100|
#| 345| 20|
#+---------+-----+
Edit
I found this scala post in which the accepted answer says that the order is preserved, though there is a discussion in the comments that contradicts that.