1

So I read a csv file with schema:

mySchema = StructType([StructField("StartTime", StringType(), True),
                       StructField("EndTime", StringType(), True)])

data = spark.read.load('/mnt/Experiments/Bilal/myData.csv', format='csv', header='false', schema = mySchema)
data.show(truncate = False)

I get this:

+---------------------------+---------------------------+
|StartTime                  |EndTime                    |
+---------------------------+---------------------------+
|2018-12-24T03:03:31.8088926|2018-12-24T03:07:35.2802489|
|2018-12-24T03:13:25.7756662|2018-12-24T03:18:10.1018656|
|2018-12-24T03:23:32.9391784|2018-12-24T03:27:57.2195314|
|2018-12-24T03:33:31.0793551|2018-12-24T03:37:04.6395942|
|2018-12-24T03:43:54.1638926|2018-12-24T03:46:38.1188857|
+---------------------------+---------------------------+

Now when I convert these columns from stringtype to timestamptype using:

data = data.withColumn('StartTime', to_timestamp('StartTime', "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"))
data = data.withColumn('EndTime', to_timestamp('EndTime', "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"))

I get null values:

+---------+-------+
|StartTime|EndTime|
+---------+-------+
|null     |null   |
|null     |null   |
|null     |null   |
|null     |null   |
|null     |null   |
+---------+-------+
Bilal Shafqat
  • 689
  • 2
  • 14
  • 26
  • Please go through the above link and make necessary changes.[enter link description here](https://stackoverflow.com/questions/45469438/pyspark-creating-timestamp-column) – Prathik Kini Mar 28 '19 at 13:26
  • @Prathik, Thanks for replying. I am already able to parse string to timestamp using to_timestamp(). But I want accuracy upto nano-seconds similar to string that I have shown. – Bilal Shafqat Mar 28 '19 at 13:36
  • There was a JIRA ticket for this issue. It states that the ticket is resolved, but, not sure why its still not working. https://issues.apache.org/jira/browse/SPARK-17914 – Gladiator Mar 28 '19 at 14:33
  • @Gladiator, I really want timestamp column including milliseconds and nanoseconds part. Any clue or any idea? – Bilal Shafqat Mar 28 '19 at 14:57
  • what is your use case? I think, if you are just sorting, you can still have it as a string and it will work the same – Gladiator Mar 28 '19 at 14:59
  • @Gladiator Noop, not just sorting but bunch of other operations as well, like splitting two datetime ranges into 10 min intervals and day intervals. Calculating different values like duration and much more and for that accuracy upto nanoseconds is a must. – Bilal Shafqat Mar 29 '19 at 05:14

1 Answers1

4

I was able to solve it by casting. Strangely It did not need format. (Spark 2.4.0. Local mode on Windows 10)
The schema before casting.

df.printSchema()
root
 |-- StartTime: string (nullable = true)
 |-- EndTime: string (nullable = true)

from pyspark.sql import functions as F
df2 = df.withColumn('StartTime', F.col('StartTime').cast("timestamp")) \
.withColumn('EndTime', F.col('EndTime').cast("timestamp"))

result

df2.show(truncate=False)
+--------------------------+--------------------------+
|StartTime                 |EndTime                   |
+--------------------------+--------------------------+
|2018-12-24 03:03:31.808892|2018-12-24 03:07:35.280248|
|2018-12-24 03:13:25.775666|2018-12-24 03:18:10.101865|
|2018-12-24 03:23:32.939178|2018-12-24 03:27:57.219531|
|2018-12-24 03:33:31.079355|2018-12-24 03:37:04.639594|
|2018-12-24 03:43:54.163892|2018-12-24 03:46:38.118885|
+--------------------------+--------------------------+

Check the schema

df2.printSchema()
root
 |-- StartTime: timestamp (nullable = true)
 |-- EndTime: timestamp (nullable = true)
Erkan Şirin
  • 1,935
  • 18
  • 28