I'm new to Spark and trying to migrate an existing python application to pyspark.
One of the first functions (in this case f(x)
) should run for every element in the dataset, but should also take into account other elements in the dataset.
the best simplification I could get this to is the following pseudo-code:
def idx_gen_a(x):
return x-5
def idx_gen_b(x):
return x*3
def f(i, x, dataset):
elem1 = dataset.get(idx_gen_a(i))
elem2 = dataset.get(idx_gen_b(i))
...
return some_calculation(x, elem1, elem2, ...)
def main(dataset):
result = []
for i, x in enumerate(dataset):
result.append(f(i, x,dataset))
Is there a Spark-ish way of doing this? foreachPartition
and aggregate
din't seem to quite fit..