Spark: How to time range join two lists in memory?

Question

I am new to Spark and I’m having difficulties wrapping my mind around this way of thinking. The following problems seem generic, but I have no idea how I can solve them using Spark and the memory of its nodes only.

I have two lists (i.e.: RDDs):

List1 - (id, start_time, value) where the tuple (id, start_time) is unique
List2 - (id, timestamp)

First problem: go over List2 and for each (id, timestamp) find in List1 a value that has the same id and the maximal start_time that is before the timestamp.

For example:

List1:
 (1, 10:00, a)
 (1, 10:05, b)
 (1, 10:30, c)
 (2, 10:02, d)

List2:
 (1, 10:02)
 (1, 10:29)
 (2, 10:03)
 (2: 10:04)

Result:
 (1, 10:02) => a
 (1, 10:29) => b
 (2, 10:03) => d
 (2: 10:04) => d

Second problem: very similar to the first problem, but now the start_time and timestamp are fuzzy. This means that a time t may be anywhere between (t - delta) and (t + delta). Again, I need to time join the lists.

Notes:

There is a solution to the first problem using Cassandra, but I'm interested in solving it using Spark and the memory of the nodes only.
List1 has thousands of entries.
List2 has tens of millions of entries.

@mwm314, I'm using PySpark, but I know Scala. Does the language matter? — Dror B., May 22 '17 at 09:36
@rogue-one, delta is 1 second and the time resolution is in seconds (I used minutes in the example for brevity) — Dror B., May 22 '17 at 09:40

rogue-one · Accepted Answer · 2017-05-22T02:47:23.503

For brevity I have converted your time data 10:02 to decimal data 10.02. just use a function that would convert the time string to a number.

The first problem can be easily solved using SparkSQL as shown below.

val list1 = spark.sparkContext.parallelize(Seq(
(1, 10.00, "a"),
(1, 10.05, "b"),
(1, 10.30, "c"),
(2, 10.02, "d"))).toDF("col1", "col2", "col3")

val list2 = spark.sparkContext.parallelize(Seq(
(1, 10.02),
(1, 10.29),
(2, 10.03),
(2, 10.04)
)).toDF("col1", "col2")

list1.createOrReplaceTempView("table1")

list2.createOrReplaceTempView("table2")


scala> spark.sql("""
     | SELECT col1,col2,col3
     | FROM
     | (SELECT
     | t2.col1, t2.col2, t1.col3,
     | ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
     | FROM table2 t2
     | LEFT JOIN table1 t1
     | ON t1.col1 = t2.col1
     | AND t2.col2 > t1.col2) tmp
     | WHERE tmp.rank = 1""").show()
+----+-----+----+
|col1| col2|col3|
+----+-----+----+
|   1|10.02|   a|
|   1|10.29|   b|
|   2|10.03|   d|
|   2|10.04|   d|
+----+-----+----+

similarly the solution for the 2'nd problem can be derived by just changing the joining condition as shown below

spark.sql("""
SELECT col1,col2,col3
FROM
(SELECT
t2.col1, t2.col2, t1.col3, 
ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
FROM table2 t2
LEFT JOIN table1 t1 
ON t1.col1 = t2.col1
AND t2.col2 between t1.col2 - ${delta} and t1.col2 + ${delta} ) tmp // replace delta with actual value
WHERE tmp.rank = 1""").show()

Excellent! Minor fix: RANK() should be used instead of ROW_NUMBER() in case there are identical rows in table2 — Dror B., May 23 '17 at 12:47

Spark: How to time range join two lists in memory?

1 Answers1