I am new to pyspark, so apologize if this is too basic of a question. I have looked at a lot of other StackOverflow posts like (this, this and this), but none of the answers there quite fit what I'm trying to do.
I have a dataframe and a string parameter/variable representing dataset date. Using the dataset date, I would like to filter on the dataframe. What I have tried is as below.
customer_panel_s3_location = f"s3://my-bucket/region_id={region_id}/marketplace_id={marketplace_id}/"
customer_panel_table = spark.read.parquet(customer_panel_s3_location)
customer_panel_table.createOrReplaceTempView("customer_panel")
dataset_date = '2023-03-16'
df_customer_panel_table = (
spark.read.parquet(customer_panel_s3_location)
.withColumn("dataset_date", dataset_date)
.filter(col("target_date") < F.to_date(col("dataset_date"), "MM-dd-yyyy"))
)
But it returns the following error:
AssertionError Traceback (most recent call last)
<ipython-input-96-6b05b499d457> in <module>
14
---> 15 df_customer_panel_table = spark.read.parquet(customer_panel_s3_location).withColumn("dataset_date", dataset_date).filter(col("target_date") < F.to_date(col("dataset_date"),"MM-dd-yyyy"))
16 print(f"{log_prefix} => Physical plan for customer_panel_table\ndf_customer_panel_table.explain()")
/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py in withColumn(self, colName, col)
1997
1998 """
-> 1999 assert isinstance(col, Column), "col should be Column"
2000 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
2001
I also looked at a well-known pyspark tutorial site to see if there's any example of converting a string scalar value to a date type, and use it in a pyspark filter
. Hoping that someone could share how to do it. Thank you in advance for your help!