Pyspark: expanding dataset to include neighbours

Question

I'm new to Spark and trying to migrate an existing python application to pyspark.

One of the first functions (in this case f(x)) should run for every element in the dataset, but should also take into account other elements in the dataset.

the best simplification I could get this to is the following pseudo-code:

    def idx_gen_a(x):
        return x-5

    def idx_gen_b(x):
        return x*3

    def f(i, x, dataset):
        elem1 = dataset.get(idx_gen_a(i))
        elem2 = dataset.get(idx_gen_b(i))
        ...
        return some_calculation(x, elem1, elem2, ...)

    def main(dataset):
        result = []
        for i, x in enumerate(dataset):
            result.append(f(i, x,dataset))

Is there a Spark-ish way of doing this? foreachPartition and aggregate din't seem to quite fit..

Take a look at [How to transform data with sliding window over time series data in Pyspark](https://stackoverflow.com/q/31685714/10465355) — 10465355, Jan 11 '19 at 14:32
@user10465355, thanks, I have looked at sparksql's window functionality, but I'm not quite sure how to apply it to this case.. — daTokenizer, Jan 11 '19 at 16:17

score 1 · Answer 1 · answered Jan 13 '19 at 10:24

I think what you're calling dataset.get maps roughly to a join in spark. I've written a rough translation of you're above code using pyspark and RDDs. f1 and f2 are your two functions. You could do something very similar using dataframes.

data = spark.range(10).rdd.map(lambda row: (row[0], row[0] * 10))

def unNest(nested):
  key, ((v1, v2), v3) = nested
  return key, (v1, v2, v3)

def f1(a): return a + 1
def f2(a): return a - 1

one = data.map(lambda pair: (f1(pair[0]), pair[1]))
two = data.map(lambda pair: (f2(pair[0]), pair[1]))
data.join(one).join(two).map(unNest).take(10)

# [(1, (10, 0, 20)),
#  (2, (20, 10, 30)),
#  (3, (30, 20, 40)),
#  (4, (40, 30, 50)),
#  (5, (50, 40, 60)),
#  (6, (60, 50, 70)),
#  (7, (70, 60, 80)),
#  (8, (80, 70, 90))]

There are different types of joins, for example inner vs outer, but I hope this is enough to point you in the right direction.

Pyspark: expanding dataset to include neighbours

1 Answers1