0

I am trying to parse a date column in pyspark by replacing dd/mm/yyyy with yyyy-mm-dd.

import pyspark.sql.functions as F
spark = SparkSession.builders.appName('test').getOrCreate()
sc = spark.sparkContext
sqlc = pyspark.sql.SQLContext(sc)

df = sqlc.createDataFrame([('01/01/2018','user1'),('28/02/2017','user2')], ['Date','user'])
df.show()
+----------+-----+
|      Date| user|
+----------+-----+
|01/01/2018|user1|
|28/02/2017|user2|
+----------+-----+

What I've done so far is:

df.select( F.concat_ws('-',F.split(F.col('Date'),'/')).alias('Date_parsed')).show()
+-----------+
|Date_parsed|
+-----------+
| 01-01-2018|
| 28-02-2017|
+-----------+

What I would like to obtain is:

+-----------+
|Date_parsed|
+-----------+
| 2018-01-01|
| 2017-02-28|
+-----------+

Any idea how to do this without using a udf?

ysearka
  • 3,805
  • 5
  • 20
  • 41
  • Good catch, nevertheless, there is also a question about changing the parsing of the date column display here. For which the function `to_date` given by Ali Yesilli works perfectly. – ysearka Oct 22 '18 at 16:36

1 Answers1

1

You can use sql functions for this case

>>> import pyspark.sql.functions as F
>>> 
>>> df.show()
+----------+-----+
|      Date| user|
+----------+-----+
|01/01/2018|user1|
|28/02/2017|user2|
+----------+-----+

>>> df.withColumn('Date',F.date_format(F.to_date('Date','dd/MM/yyyy'),'yyyy-MM-dd')).show()
+----------+-----+
|      Date| user|
+----------+-----+
|2018-01-01|user1|
|2017-02-28|user2|
+----------+-----+

Update: Note that in some versions of spark (e.g 2.1.1), to_date doesn't take formatting as argument, then you can use F.unix_timestamp to format the date column beforehand:

df.withColumn('Date',F.date_format(F.to_date(
            F.unix_timestamp(F.col('Date'),'dd/MM/yyyy').cast('timestamp')
                                             ),'yyyy-MM-dd')).show()
ysearka
  • 3,805
  • 5
  • 20
  • 41
Ali Yesilli
  • 2,071
  • 13
  • 16
  • I have tried using that, unfortunately I get the error: `TypeError: to_date() takes exactly 1 argument (2 given)`. Note that I am using spark 2.1.1. – ysearka Oct 22 '18 at 09:33
  • 1
    I think it is because of Spark version. Try this df.withColumn('Date', F.date_format(F.unix_timestamp('Date', 'dd/MM/yyyy').cast("timestamp"),'yyyy-MM-dd')).show() – Ali Yesilli Oct 22 '18 at 09:36
  • This works perfectly, thank you! – ysearka Oct 22 '18 at 09:37