0

I am new to pyspark, so apologize if this is too basic of a question. I have looked at a lot of other StackOverflow posts like (this, this and this), but none of the answers there quite fit what I'm trying to do.

I have a dataframe and a string parameter/variable representing dataset date. Using the dataset date, I would like to filter on the dataframe. What I have tried is as below.

customer_panel_s3_location = f"s3://my-bucket/region_id={region_id}/marketplace_id={marketplace_id}/"
customer_panel_table = spark.read.parquet(customer_panel_s3_location)
customer_panel_table.createOrReplaceTempView("customer_panel")
dataset_date = '2023-03-16'
df_customer_panel_table = (
    spark.read.parquet(customer_panel_s3_location)
    .withColumn("dataset_date", dataset_date)
    .filter(col("target_date") < F.to_date(col("dataset_date"), "MM-dd-yyyy"))
)

But it returns the following error:

AssertionError                            Traceback (most recent call last)
<ipython-input-96-6b05b499d457> in <module>
     14 
---> 15 df_customer_panel_table = spark.read.parquet(customer_panel_s3_location).withColumn("dataset_date", dataset_date).filter(col("target_date") < F.to_date(col("dataset_date"),"MM-dd-yyyy"))
     16 print(f"{log_prefix} => Physical plan for customer_panel_table\ndf_customer_panel_table.explain()")

/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py in withColumn(self, colName, col)
   1997 
   1998         """
-> 1999         assert isinstance(col, Column), "col should be Column"
   2000         return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
   2001 

I also looked at a well-known pyspark tutorial site to see if there's any example of converting a string scalar value to a date type, and use it in a pyspark filter. Hoping that someone could share how to do it. Thank you in advance for your help!

Kashyap
  • 15,354
  • 13
  • 64
  • 103
user1330974
  • 2,500
  • 5
  • 32
  • 60
  • 2
    It should be `.withColumn("dataset_date",F.lit(dataset_date))` - lit is used for literal values for column. Also if you can provide sample data with expected output, it would be easy to help. – Ronak Jain Mar 17 '23 at 18:07
  • 1
    other way of saying what @RonakJain said is: the second param is supposed to be a "column" object, which is more akin to `pandas.Series`. If you want a literal value then you can use the `pyspark.sql.functions.lit()` function to create a column object out of a literal. – Kashyap Mar 17 '23 at 21:23
  • 1
    @RonakJain and @Kashyap - Thank you both! `F.lit(dataset_date)` works like a charm. :) Having said that, I'll accept @DivineCoder's comment below because it was submitted as an answer. – user1330974 Mar 17 '23 at 23:11

1 Answers1

1

In order to create a date column, use pyspark.sql.functions.lit function to wrap your string date literal before passing it to withColumn.

So the line that adds a fixed date column to your dataframe will look like the following. (I changed the variable names so it can fit the screen nicely)

...
# see how the following line uses lit
dataset_date = '2023-03-16'
df = df.withColumn("dataset_date", F.lit(dataset_date))
df = df.filter(col("target_date") < F.to_date(col("dataset_date"),"MM-dd-yyyy"))
...

DivineCoder
  • 702
  • 5
  • 11