2

I'm using PySpark to develop a Machine Learning project. I have a lot of records with a field that stores a date taken from MongoDB. This date is a string but contains a date in an isodate format.

How can I convert it to one of the date formats allowed by Apache Spark? In case is possible, I would need to convert the whole column which contains this date field.

Here's an example of this field in a JSON format:

"date": "2020-11-09T07:27:57.078Z"

mck
  • 40,932
  • 13
  • 35
  • 50
YPD
  • 187
  • 13

1 Answers1

5

Just cast the column to a timestamp using df.select(F.col('date').cast('timestamp')). If you want date type, cast to date instead.

import pyspark.sql.functions as F

df = spark.createDataFrame([['2020-11-09T07:27:57.078Z']]).toDF('date')
df.show()
+------------------------+
|date                    |
+------------------------+
|2020-11-09T07:27:57.078Z|
+------------------------+

>>> df.printSchema()
root
 |-- date: string (nullable = true)

# cast to timestamp
df2 = df.select(F.col('date').cast('timestamp'))

>>> df2.printSchema()
root
 |-- date: timestamp (nullable = true)

df2.show()
+-----------------------+
|date                   |
+-----------------------+
|2020-11-09 07:27:57.078|
+-----------------------+

# cast to date
df3 = df.select(F.col('date').cast('date'))

>>> df3.printSchema()
root
 |-- date: date (nullable = true)

df3.show()
+----------+
|      date|
+----------+
|2020-11-09|
+----------+
mck
  • 40,932
  • 13
  • 35
  • 50