how to compute diff for one col in spark dataframe?

Question

+-------------------+
|           Dev_time|
+-------------------+
|2015-09-18 05:00:20|
|2015-09-18 05:00:21|
|2015-09-18 05:00:22|
|2015-09-18 05:00:23|
|2015-09-18 05:00:24|
|2015-09-18 05:00:25|
|2015-09-18 05:00:26|
|2015-09-18 05:00:27|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:38|
|2015-09-18 05:00:39|
+-------------------+

For spark's dataframe, I want to compute the diff of the datetime ,just like in numpy.diff(array)

Welcome to SO! Please share a [MCVE](http://stackoverflow.com/help/mcve) so we can try to help. This is a very low quality question — eliasah, Nov 26 '15 at 10:21

score 1 · Accepted Answer · edited May 23 '17 at 11:44

1

Generally speaking there is no efficient way to achieve this using Spark DataFrames. Not to mention things like order become quite tricky in a distributed setup. Theoretically you can use lag function as follows:

from pyspark.sql.functions import lag, col, unix_timestamp
from pyspark.sql.window import Window

dev_time = (unix_timestamp(col("dev_time")) * 1000).cast("timestamp")

df = sc.parallelize([
    ("2015-09-18 05:00:20", ), ("2015-09-18 05:00:21", ),
    ("2015-09-18 05:00:22", ), ("2015-09-18 05:00:23", ),
    ("2015-09-18 05:00:24", ), ("2015-09-18 05:00:25", ),
    ("2015-09-18 05:00:26", ), ("2015-09-18 05:00:27", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:37", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:38", ),
    ("2015-09-18 05:00:39", )
]).toDF(["dev_time"]).withColumn("dev_time", dev_time)

w = Window.orderBy("dev_time")
lag_dev_time = lag("dev_time").over(w).cast("integer")

diff = df.select((col("dev_time").cast("integer") - lag_dev_time).alias("diff"))

## diff.show()
## +----+
## |diff|
## +----+
## |null|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |  10|
## ...

but it is extremely inefficient (as for window functions move all data to a single partition if no PARTITION BY clause is provided). In practice it makes more sense to use sliding method on a RDD (Scala) or implement your own sliding window (Python). See:

edited May 23 '17 at 11:44

Community

1
1

answered Nov 27 '15 at 00:38

zero323

322,348
103
959
935

How about If use approxCountDistinct function? – giaosudau Jan 05 '16 at 07:09
@giaosudau How does it help? – zero323 Jan 05 '16 at 07:11
Yeah. Because It implement hyperloglog algorithms? More efficient when counting the distinct as I think. I am not sure about that just asking. – giaosudau Jan 05 '16 at 07:14
@giaosudau What I mean is how counting distinct elements helps with computing difference over timeseries? – zero323 Jan 05 '16 at 07:17
Sorry I just think simple is he want to count distinct of the datetime. So I think just using the DataFrame Function. – giaosudau Jan 05 '16 at 07:24
@giaosudau If thats the case then sure. But I don't think this is the goal here. – zero323 Jan 05 '16 at 07:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/99770/discussion-between-giaosudau-and-zero323). – giaosudau Jan 05 '16 at 07:29

how to compute diff for one col in spark dataframe?

1 Answers1

Linked