How do you define a date format for a column in a PySpark SQL schema to parse it to a timestamp?
I am only able to do this in an additional DataFrame transformation step on a StringType
field which is cumbersome.
Thank you!
Here is the indirect way I apply for now:
Given input .csv data like:
1,Foo,19/06/2017 14:41:20
2,Bar,19/06/2018 15:41:45
I do:
field_integer = StructField('my_integer', IntegerType())
field_string = StructField('my_string', StringType())
# Want some date format here but need to treat date field as StringType.
field_date = StructField('my_date', StringType())
my_schema = StructType([field_integer, field_string, field_date])
df_test = spark.read.schema(my_schema).csv('my_data.csv')
df_new = df_test.select(['my_integer', 'my_string',
to_timestamp(df_test['my_date'], 'dd/MM/yyyy HH:mm:ss').alias('parsed_date')])
which works but is not very direct.
df_new.show()
+----------+---------+-------------------+
|my_integer|my_string| parsed_date|
+----------+---------+-------------------+
| 1| David|2015-06-19 15:41:45|
| 2| Blah|2015-06-19 15:41:45|
+----------+---------+-------------------+
print(df_new.printSchema())
root
|-- my_integer: integer (nullable = true)
|-- my_string: string (nullable = true)
|-- parsed_date: timestamp (nullable = true)