10

Let me break this problem down to a smaller chunk. I have a DataFrame in PySpark, where I have a column arrival_date in date format -

from pyspark.sql.functions import to_date
values = [('22.05.2016',),('13.07.2010',),('15.09.2012',),(None,)]
df = sqlContext.createDataFrame(values,['arrival_date'])

#Following code line converts String into Date format
df = df.withColumn('arrival_date',to_date(col('arrival_date'),'dd.MM.yyyy'))
df.show()
+------------+
|arrival_date|
+------------+
|  2016-05-22|
|  2010-07-13|
|  2012-09-15|
|        null|
+------------+

df.printSchema()
root
 |-- arrival_date: date (nullable = true)

After applying a lot of transformations to the DataFrame, I finally wish to fill in the missing dates, marked as null with 01-01-1900.

One method to do this is to convert the column arrival_date to String and then replace missing values this way - df.fillna('1900-01-01',subset=['arrival_date']) and finally reconvert this column to_date. This is very unelegant.

The following code line doesn't work, as expected and I get an error-

df = df.fillna(to_date(lit('1900-01-01'),'yyyy-MM-dd'), subset=['arrival_date'])

The documentation says The value must be of the following type: Int, Long, Float, Double, String, Boolean.

Another way is by using withColumn() and when() -

df = df.withColumn('arrival_date',when(col('arrival_date').isNull(),to_date(lit('01.01.1900'),'dd.MM.yyyy')).otherwise(col('arrival_date')))

Is there a way, where I could directly assign a date of my choice to a date formatted column by using some function?

Anyone has any better suggestion?

Community
  • 1
  • 1
cph_sto
  • 7,189
  • 12
  • 42
  • 78
  • Did you ever figure this out? – Rens Dec 09 '21 at 12:19
  • 2
    Last time till I used to work with PySpark, I could not find a simple solution to it. May be in the last 2 years they have something in the new release, don't know. – cph_sto Dec 10 '21 at 11:14

1 Answers1

4

The second way should be the way to do it, but you don't have to use to_date to transform between string and date, just use datetime.date(1900, 1, 1).

import datetime as dt
df = df.withColumn('arrival_date', when(col('arrival_date').isNull(), dt.date(1900, 1, 1)).otherwise(col('arrival_date')))
zzzk
  • 135
  • 10
  • 4
    Hello. If you read the question carefully, I have already mentioned this solution under caption ‘Another way’. But, that’s something I wanted to avoid. I wanted to use .fillna() directly time solve this, avoiding if/else. – cph_sto Jan 23 '20 at 06:22
  • This is the solution I need. Perfect! – JamesAng Apr 29 '22 at 02:53