Let's say I've got dataset like this:
| item | event | timestamp | user |
|:-----------|------------:|:---------:|:---------:|
| titanic | view | 1 | 1 |
| titanic | add to bag | 2 | 1 |
| titanic | close | 3 | 1 |
| avatar | view | 6 | 1 |
| avatar | close | 10 | 1 |
| titanic | view | 20 | 1 |
| titanic | purchase | 30 | 1 |
and so on. And I need to calculate sessionId for each user for continuous going events corresponding to particular item.
So for that particular data output should be the following :
| item | event | timestamp | user | sessionId |
|:-----------|------------:|:---------:|:---------:|:--------------:|
| titanic | view | 1 | 1 | session1 |
| titanic | add to bag | 2 | 1 | session1 |
| titanic | close | 3 | 1 | session1 |
| avatar | view | 6 | 1 | session2 |
| avatar | close | 10 | 1 | session2 |
| titanic | view | 20 | 1 | session3 |
| titanic | purchase | 30 | 1 | session3 |
I was trying to use similar approach as described here Spark: How to create a sessionId based on userId and timestamp with window:
Window.partitionBy("user", "item").orderBy("timestamp")
But that just doesn't work because the same user - item combination might be in different sessions. For example see session1 and session3.
And with that window they become the same session.
Need help with another approach how to implement that.