0

I have a spark DataFrame like this:

timestamp            userId
2016-07-26 12:05:00   a
2016-07-26 12:05:01   b
2016-07-26 12:05:02   c
2016-07-26 12:05:03   d
2016-07-26 12:05:04   e
2016-07-26 12:05:05   f

I want to group the rows that are within 5 sec difference in one group, like:

timestamp            userId   group
2016-07-26 12:05:00   a        1  
2016-07-26 12:05:01   b        1
2016-07-26 12:05:02   c        1
2016-07-26 12:05:03   d        1
2016-07-26 12:05:04   e        1
2016-07-26 12:05:05   f        2

Is there a way to do this without converting the spark DataFrame into R dataframe?

Sotos
  • 51,121
  • 6
  • 32
  • 66
Abhishek Gupta
  • 77
  • 1
  • 2
  • 9

1 Answers1

0

This particular functionality is commonly known as Sessionization and frequently used by web analysts to identify sessions for a particular user. There are built in UDFs in hive which can be used with sparksqlcontext. For example, https://docs.treasuredata.com/articles/udfs

Mohit Bansal
  • 131
  • 9