-1

I have a dataframe like below

+--+--+-----------+
| a| b|       date|
+--+--+-----------+
| 1| 2| 01/01/2022|
| 2| 3| 01/01/2021|
| 3| 4| 12/20/2021|
+--+--+-----------+

I have tried the code below but it keeps showing the 01/01/2022 date even though 30/12/2021 is not greater than 01/01/2022.

 df.filter(("30/12/2021" > col("date")) 

I have tried casting both to dates and it returns 0 records then.

df.filter("cast(StartDate as date) >= cast('2017-02-03' as date)")

Below is sample code

from pyspark.shell import spark
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

data2 = [(1, 2,"01/01/2022"),
         (1, 3,"01/01/2021"),
         (2, 4,"12/20/2021"), 
         ]

schema = StructType([ \
    StructField("a", IntegerType(), True), \
    StructField("b", IntegerType(), True), \
    StructField("date", StringType(), True), \
    ])

df = spark.createDataFrame(data=data2, schema=schema)


df.filter(("30/12/2021" > col("date"))).show()
Vincent Doba
  • 4,343
  • 3
  • 22
  • 42
lunbox
  • 331
  • 3
  • 11

2 Answers2

1

You are comparing dates as strings, which will compare alphabetically from the left, so 01/01/2022 is less than 30/12/2021 because 0 is less than 3.

You need to convert your string to a date, e.g.:

import datetime

s1 = "30/12/2021"
s2 = "01/01/2022"
print(s1 < s2)  # False

d1 = datetime.datetime.strptime(s1, "%d/%m/%Y")
d2 = datetime.datetime.strptime(s2, "%d/%m/%Y")
print(d1 < d2)  # True

To do this in pyspark, it looks like you'll need to set your field to DateType and use to_date().

import random
  • 3,054
  • 1
  • 17
  • 22
1

Your dates are formatted in this order : month, day, year.

Compare dates as text won't work (unless you change the order in the string).

A good solution is to transform dates from string to datetime.

df["date"] = df["date"].map(lambda x: datetime.strptime(x, "%m/%d/%Y"))

Dataframe types before (result of df.dtypes) :

a        int64
b        int64
date    object

Dataframe types after (result of df.dtypes) :

a                int64
b                int64
date    datetime64[ns]
0x0fba
  • 1,520
  • 1
  • 1
  • 11