I'm trying to figure out if what I'm trying to accomplish is even possible in Spark. Let's say I have a CSV that if read in as a DataFrame that looks like so:
+---------------------+-----------+-------+-------------+
| TimeStamp | Customer | User | Application |
+---------------------+-----------+-------+-------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 |
| 2017-01-01 12:00:05 | customer1 | user1 | app1 |
| 2017-01-01 14:00:03 | customer1 | user2 | app2 |
| 2017-01-01 23:50:50 | customer1 | user1 | app1 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 |
+---------------------+-----------+-------+-------------+
I'm trying to produce a dataframe that includes a count of the number of the times a unique user from a certain customer has visited an application in the last 24 hours. So the result would look like so:
+---------------------+-----------+-------+-------------+----------------------+
| TimeStamp | Customer | User | Application | UniqueUserVisitedApp |
+---------------------+-----------+-------+-------------+----------------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 | 0 |
| 2017-01-01 12:00:05 | customer1 | user2 | app1 | 1 |
| 2017-01-01 13:00:05 | customer1 | user2 | app1 | 2 |
| 2017-01-01 14:00:03 | customer1 | user1 | app1 | 2 |
| 2017-01-01 23:50:50 | customer1 | user3 | app1 | 2 |
| 2017-01-01 23:50:51 | customer2 | user4 | app2 | 0 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 | 3 |
+---------------------+-----------+-------+-------------+----------------------+
So I can do a tumbling window with the following code below, but that's not quite what we are looking for.
val data = spark.read.csv('path/to/csv')
val tumblingWindow = data
.groupBy(col("Customer"), col("Application"), window(data.col("TimeStamp"), "24 hours"))
.agg(countDistinct("user")).as("UniqueUsersVisitedApp")
The result is this:
+-----------+-------------+-------------------------+-----------------------+
| Customer | Application | Window | UniqueUsersVisitedApp |
+-----------+-------------+-------------------------+-----------------------+
| customer1 | app1 | [2017-01-01 00:00:00... | 2 |
| customer2 | app2 | [2017-01-01 00:00:00... | 1 |
| customer1 | app1 | [2017-01-02 00:00:00... | 1 |
+-----------+-------------+-------------------------+-----------------------+
Any help would be much appreciated.