2

I have a csv file presence.csv in my datalake :

TIME,RESULT,PARENTLOCALIZATIONCODE,PARENTNAME,SIGFOXID,STATUS
2018-12-10T00:06:14.5464070Z,OK,A,A,19A9BC,OFF_PERIOD

I tried to read it but the TIME column output is wrong :

data = sqlContext.read.csv(
    'presence.csv',
    header='true', 
    inferSchema= 'true',  
    sep=","
)

data.show(truncate=False)
+----------------------+------+----------------------+----------+--------+----------+
|TIME                  |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
+----------------------+------+----------------------+----------+--------+----------+
|2018-12-10 01:37:18.07|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
+----------------------+------+----------------------+----------+--------+----------+

I first thought about a time difference between my server and gmt time, but one is 00:06:14 and the other one is 01:37:18 so about 01:31 difference which is just weird.

Do you have any idea why this convertion happens ?

pault
  • 41,343
  • 15
  • 107
  • 149
Steven
  • 14,048
  • 6
  • 38
  • 73
  • @pault at least, it seems to be consistent in terms of minute and secondes ... the hour difference results probably in server time (USA or Europe I guess) – Steven Feb 26 '19 at 16:53
  • 1
    you can specify the timestamp format, instead of spark default timestamp format spark.read.csv("b.csv",header=True,inferSchema=True,timestampFormat="yyyy-MM-dd hh:mm:sss") – Ranga Vure Feb 26 '19 at 16:59

1 Answers1

3

From the docs for pyspark.sql.DataFrameReader.csv, the default timestampFormat is:

The main problem with your data is that you have 3 extra values in the fraction of a second. So for this data you need use timestampFormat="yyyy-MM-dd'T'hh:mm:ss:SSSSSSZZ"

data = spark.read.csv(
    'presence.csv',
    header='true', 
    inferSchema= 'true',  
    sep=",",
    timestampFormat="yyyy-MM-dd'T'hh:mm:ss:SSSSSSZZ"
)

data.show(truncate=False)
#+-----------------------+------+----------------------+----------+--------+----------+
#|TIME                   |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
#+-----------------------+------+----------------------+----------+--------+----------+
#|2018-12-09 19:06:14.546|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
#+-----------------------+------+----------------------+----------+--------+----------+

But as you can see here, the TIME column is being converted to local time (which on my system is GMT-4).

If this is not what you want, the "fix" depends on your spark version and is detailed in the answers on Spark Strutured Streaming automatically converts timestamp to local time.

If you applied version specific "fix", you'd see the following result:

df.show(truncate=False)
#+-----------------------+------+----------------------+----------+--------+----------+
#|TIME                   |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
#+-----------------------+------+----------------------+----------+--------+----------+
#|2018-12-10 00:06:14.546|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
#+-----------------------+------+----------------------+----------+--------+----------+

References:

pault
  • 41,343
  • 15
  • 107
  • 149