1

Is it possible to get the first Datetime of each day from a certain dataframe?

Schema:

root
 |-- Datetime: timestamp (nullable = true)
 |-- Quantity: integer (nullable = true)


+-------------------+--------+
|           Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11|     200|
|2021-09-10 10:08:16|     100|
|2021-09-11 10:05:11|     100|
|2021-09-11 10:07:25|     100|
|2021-09-11 10:07:14|    3000|
|2021-09-12 09:24:11|    1000|
+-------------------+--------+

Desired output:

+-------------------+--------+
|           Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11|     200|
|2021-09-11 10:05:11|     100|
|2021-09-12 09:24:11|    1000|
+-------------------+--------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76
roccaforte
  • 81
  • 5
  • 2
    Does this answer your question? [How to select the first row of each group?](https://stackoverflow.com/questions/33878370/how-to-select-the-first-row-of-each-group) – mazaneicha Feb 17 '22 at 19:15

1 Answers1

2

You can use row_number for that. Simply define a Window partitioned by day and ordered by Datetime:

from pyspark.sql import functions as F, Window

w = Window.partitionBy(F.to_date("Datetime")).orderBy("Datetime")

df1 = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")

df1.show()
#+-------------------+--------+
#|           Datetime|Quantity|
#+-------------------+--------+
#|2021-09-10 10:08:11|     200|
#|2021-09-11 10:05:11|     100|
#|2021-09-12 09:24:11|    1000|
#+-------------------+--------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76