1

Pyspark 2.1:

I've created a datafame and have a timestamp column that I convert to a unix timestamp. However, the column derived from the unix timestamp is incorrect. As the timestamp increases the unix_timestamp should also increase, however this is not the case. You can see this from the below code an example of this. Note that when you sort the timestamp variable and the unix_ts variable you get different orders.

from pyspark.sql import functions as F

df = sqlContext.createDataFrame([
        ("a", "1", "2018-01-08 23:03:23.325359"),
        ("a", "2", "2018-01-09 00:03:23.325359"),
        ("a", "3", "2018-01-09 00:03:25.025240"),
        ("a", "4", "2018-01-09 00:03:27.025240"),
        ("a", "5", "2018-01-09 00:08:27.021240"),
        ("a", "6", "2018-01-09 03:03:27.025240"),
        ("a", "7", "2018-01-09 05:03:27.025240"),


], ["person_id", "session_id", "timestamp"])

df = df.withColumn("unix_ts",F.unix_timestamp(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss.SSSSSS"))

df.orderBy("timestamp").show(10,False)
df.orderBy("unix_ts").show(10,False)

Output:

+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp                 |unix_ts   |
+---------+----------+--------------------------+----------+
|a        |1         |2018-01-08 23:03:23.325359|1515474528|
|a        |2         |2018-01-09 00:03:23.325359|1515478128|
|a        |3         |2018-01-09 00:03:25.025240|1515477830|
|a        |4         |2018-01-09 00:03:27.025240|1515477832|
|a        |5         |2018-01-09 00:08:27.021240|1515478128|
|a        |6         |2018-01-09 03:03:27.025240|1515488632|
|a        |7         |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+

+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp                 |unix_ts   |
+---------+----------+--------------------------+----------+
|a        |1         |2018-01-08 23:03:23.325359|1515474528|
|a        |3         |2018-01-09 00:03:25.025240|1515477830|
|a        |4         |2018-01-09 00:03:27.025240|1515477832|
|a        |5         |2018-01-09 00:08:27.021240|1515478128|
|a        |2         |2018-01-09 00:03:23.325359|1515478128|
|a        |6         |2018-01-09 03:03:27.025240|1515488632|
|a        |7         |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+

Is this a bug or am I doing something/implementing this wrong?

Also, you can see that both 2018-01-09 00:03:27.025240 and2018-01-09 00:08:27.021240produce the same unix_timestamp of1515495832`

Micah Pearce
  • 1,805
  • 3
  • 28
  • 61

1 Answers1

1

The problem seems to be that Spark's unix_timestamp internally uses Java's SimpleDateFormat to parse dates, and SimpleDateFormat does not support microseconds (see e.g. here). Furthermore, unix_timestamp returns a long, so it only has a granularity of seconds.

One fix would be to just parse without the microsecond information, and add the microseconds back in separately:

df = spark.createDataFrame([
        ("a", "1", "2018-01-08 23:03:23.325359"),
        ("a", "2", "2018-01-09 00:03:23.325359"),
        ("a", "3", "2018-01-09 00:03:25.025240"),
        ("a", "4", "2018-01-09 00:03:27.025240"),
        ("a", "5", "2018-01-09 00:08:27.021240"),
        ("a", "6", "2018-01-09 03:03:27.025240"),
        ("a", "7", "2018-01-09 05:03:27.025240"),
], ["person_id", "session_id", "timestamp"])

# parse the timestamp up to the seconds place
df = df.withColumn("unix_ts_sec",f.unix_timestamp(f.substring(f.col("timestamp"), 1, 19), "yyyy-MM-dd HH:mm:ss"))
# extract the microseconds
df = df.withColumn("microsec", f.substring(f.col("timestamp"), 21, 6).cast('int'))
# add to get full epoch time accurate to the microsecond
df = df.withColumn("unix_ts", f.col("unix_ts_sec") + 1e-6 * f.col("microsec"))

Side note: I don't have easy access to Spark 2.1, but using Spark 2.2 I get nulls for unix_ts as originally written. You seem to have hit some sort of a Spark 2.1 bug giving you nonsense timestamps.

abeboparebop
  • 7,396
  • 6
  • 37
  • 46