Pyspark 2.1:
I've created a datafame and have a timestamp column that I convert to a unix timestamp. However, the column derived from the unix timestamp is incorrect. As the timestamp increases the unix_timestamp should also increase, however this is not the case. You can see this from the below code an example of this. Note that when you sort the timestamp variable and the unix_ts variable you get different orders.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
("a", "1", "2018-01-08 23:03:23.325359"),
("a", "2", "2018-01-09 00:03:23.325359"),
("a", "3", "2018-01-09 00:03:25.025240"),
("a", "4", "2018-01-09 00:03:27.025240"),
("a", "5", "2018-01-09 00:08:27.021240"),
("a", "6", "2018-01-09 03:03:27.025240"),
("a", "7", "2018-01-09 05:03:27.025240"),
], ["person_id", "session_id", "timestamp"])
df = df.withColumn("unix_ts",F.unix_timestamp(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss.SSSSSS"))
df.orderBy("timestamp").show(10,False)
df.orderBy("unix_ts").show(10,False)
Output:
+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp |unix_ts |
+---------+----------+--------------------------+----------+
|a |1 |2018-01-08 23:03:23.325359|1515474528|
|a |2 |2018-01-09 00:03:23.325359|1515478128|
|a |3 |2018-01-09 00:03:25.025240|1515477830|
|a |4 |2018-01-09 00:03:27.025240|1515477832|
|a |5 |2018-01-09 00:08:27.021240|1515478128|
|a |6 |2018-01-09 03:03:27.025240|1515488632|
|a |7 |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+
+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp |unix_ts |
+---------+----------+--------------------------+----------+
|a |1 |2018-01-08 23:03:23.325359|1515474528|
|a |3 |2018-01-09 00:03:25.025240|1515477830|
|a |4 |2018-01-09 00:03:27.025240|1515477832|
|a |5 |2018-01-09 00:08:27.021240|1515478128|
|a |2 |2018-01-09 00:03:23.325359|1515478128|
|a |6 |2018-01-09 03:03:27.025240|1515488632|
|a |7 |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+
Is this a bug or am I doing something/implementing this wrong?
Also, you can see that both 2018-01-09 00:03:27.025240 and
2018-01-09 00:08:27.021240produce the same unix_timestamp of
1515495832`