Setting date format in PySpark SQL schema definition

Question

How do you define a date format for a column in a PySpark SQL schema to parse it to a timestamp?

I am only able to do this in an additional DataFrame transformation step on a StringType field which is cumbersome.

Thank you!

Here is the indirect way I apply for now:

Given input .csv data like:

1,Foo,19/06/2017 14:41:20
2,Bar,19/06/2018 15:41:45

I do:

field_integer = StructField('my_integer', IntegerType())
field_string = StructField('my_string', StringType())
# Want some date format here but need to treat date field as StringType.
field_date = StructField('my_date', StringType())
my_schema = StructType([field_integer, field_string, field_date])

df_test = spark.read.schema(my_schema).csv('my_data.csv')
df_new = df_test.select(['my_integer', 'my_string', 
    to_timestamp(df_test['my_date'], 'dd/MM/yyyy HH:mm:ss').alias('parsed_date')])

which works but is not very direct.

df_new.show()

+----------+---------+-------------------+
|my_integer|my_string|        parsed_date|
+----------+---------+-------------------+
|         1|    David|2015-06-19 15:41:45|
|         2|     Blah|2015-06-19 15:41:45|
+----------+---------+-------------------+

print(df_new.printSchema())

root
|-- my_integer: integer (nullable = true)
|-- my_string: string (nullable = true)
|-- parsed_date: timestamp (nullable = true)

Awesome, that works - thanks a lot! For anyone wondering: [RTFM](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader). — OG Dude, Jan 20 '18 at 15:42

Setting date format in PySpark SQL schema definition

0 Answers0