I am new to Spark and I’m having difficulties wrapping my mind around this way of thinking. The following problems seem generic, but I have no idea how I can solve them using Spark and the memory of its nodes only.
I have two lists (i.e.: RDDs):
- List1 - (id, start_time, value) where the tuple (id, start_time) is unique
- List2 - (id, timestamp)
First problem: go over List2 and for each (id, timestamp) find in List1 a value that has the same id and the maximal start_time that is before the timestamp.
For example:
List1:
(1, 10:00, a)
(1, 10:05, b)
(1, 10:30, c)
(2, 10:02, d)
List2:
(1, 10:02)
(1, 10:29)
(2, 10:03)
(2: 10:04)
Result:
(1, 10:02) => a
(1, 10:29) => b
(2, 10:03) => d
(2: 10:04) => d
Second problem: very similar to the first problem, but now the start_time and timestamp are fuzzy. This means that a time t may be anywhere between (t - delta) and (t + delta). Again, I need to time join the lists.
Notes:
- There is a solution to the first problem using Cassandra, but I'm interested in solving it using Spark and the memory of the nodes only.
- List1 has thousands of entries.
- List2 has tens of millions of entries.