3

I am having a date column in my dataframe

enter image description here

I wanted to filter out the last 14 days from the dataframe using the date column. I tried the below code but it's not working

  last_14 = df.filter((df('Date')> date_add(current_timestamp(), -14)).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")

Event_time, user_id, impressions, clicks, URL is my other columns

Can anyone advise how to do this?

sthambi
  • 197
  • 2
  • 17

2 Answers2

4
from pyspark.sql import functions as F, types as T

df  = spark.createDataFrame(
      [
    ('2022-03-10',),
    ('2022-03-09',),
    ('2022-03-08',),
    ('2022-02-02',),
    ('2022-02-01',)
      ], ['Date']
    ).withColumn('Date', F.to_date('Date', 'y-M-d'))

df\
    .filter((F.col('Date') > F.date_sub(F.current_date(), 14)))\
    .show()
+----------+
|      Date|
+----------+
|2022-03-10|
|2022-03-09|
|2022-03-08|
+----------+

In your code it would be:

 last_14 = df.filter((F.col('Date') > F.date_sub(F.current_date(), 14))).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")
Luiz Viola
  • 2,143
  • 1
  • 11
  • 30
0

Another similar solution (it's just a matter of style preference) is to use expr to write a SQL expression:

last_14 = df.filter(F.col('Date') > F.expr(f'current_date() - interval 14 days'))

last_14.show()
+----------+
|      Date|
+----------+
|2023-07-10|
|2023-07-09|
|2023-07-08|
+----------+
Ric S
  • 9,073
  • 3
  • 25
  • 51