spark read.csv incorrectly parsing timestamps

Question

I have a csv file presence.csv in my datalake :

TIME,RESULT,PARENTLOCALIZATIONCODE,PARENTNAME,SIGFOXID,STATUS
2018-12-10T00:06:14.5464070Z,OK,A,A,19A9BC,OFF_PERIOD

I tried to read it but the TIME column output is wrong :

data = sqlContext.read.csv(
    'presence.csv',
    header='true', 
    inferSchema= 'true',  
    sep=","
)

data.show(truncate=False)
+----------------------+------+----------------------+----------+--------+----------+
|TIME                  |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
+----------------------+------+----------------------+----------+--------+----------+
|2018-12-10 01:37:18.07|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
+----------------------+------+----------------------+----------+--------+----------+

I first thought about a time difference between my server and gmt time, but one is 00:06:14 and the other one is 01:37:18 so about 01:31 difference which is just weird.

Do you have any idea why this convertion happens ?

@pault at least, it seems to be consistent in terms of minute and secondes ... the hour difference results probably in server time (USA or Europe I guess) — Steven, Feb 26 '19 at 16:53
you can specify the timestamp format, instead of spark default timestamp format spark.read.csv("b.csv",header=True,inferSchema=True,timestampFormat="yyyy-MM-dd hh:mm:sss") — Ranga Vure, Feb 26 '19 at 16:59

pault · Accepted Answer · 2019-02-26T17:39:40.843

From the docs for pyspark.sql.DataFrameReader.csv, the default timestampFormat is:

yyyy-MM-dd'T'HH:mm:ss.SSSXXX for spark version 2.2 and above
yyyy-MM-dd'T'HH:mm:ss.SSSZZ for spark version 2.1
None for spark version 2.0 which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf()

The main problem with your data is that you have 3 extra values in the fraction of a second. So for this data you need use timestampFormat="yyyy-MM-dd'T'hh:mm:ss:SSSSSSZZ"

data = spark.read.csv(
    'presence.csv',
    header='true', 
    inferSchema= 'true',  
    sep=",",
    timestampFormat="yyyy-MM-dd'T'hh:mm:ss:SSSSSSZZ"
)

data.show(truncate=False)
#+-----------------------+------+----------------------+----------+--------+----------+
#|TIME                   |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
#+-----------------------+------+----------------------+----------+--------+----------+
#|2018-12-09 19:06:14.546|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
#+-----------------------+------+----------------------+----------+--------+----------+

But as you can see here, the TIME column is being converted to local time (which on my system is GMT-4).

If this is not what you want, the "fix" depends on your spark version and is detailed in the answers on Spark Strutured Streaming automatically converts timestamp to local time.

If you applied version specific "fix", you'd see the following result:

df.show(truncate=False)
#+-----------------------+------+----------------------+----------+--------+----------+
#|TIME                   |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
#+-----------------------+------+----------------------+----------+--------+----------+
#|2018-12-10 00:06:14.546|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
#+-----------------------+------+----------------------+----------+--------+----------+

References:

spark read.csv incorrectly parsing timestamps

1 Answers1