Spark union RDDs and preserve order

Question

I'm reading data from three separate CSVs into Spark RDDs (using pyspark).

sc = SparkContext()
rdd_list = []
feattable = ['csv1', 'csv2', 'csv3']
for idx, ftable in enumerate(feattable):
    print("Loading data from: %s" % ftable, idx)
    thisRdd = sc.textFile('hdfs:' + ftable).mapPartitions(lambda line: read_line_csv(line))
    thisRdd = thisRdd.map("some mapping no relevant to question")\
        .reduceByKey("some reducing no relevant to question")\
        .map("final mapping").persist()
    rdd_list.append(thisRdd)

rdd = sc.union(rdd_list).reduceByKey(lambda x,y : x+y)
print(rdd.take(5))

During the final mapping, for debugging, I get a tuple of (user_id, list(len(csv entries per user)).

I run the code once and get:

Loading data from: csv1 0
Loading data from: csv2 1
Loading data from: csv3 2
[('749003', [[2000], [9081], [100]])]

running again I get:

...
[('749003', [[9081], [100], [2000]])]

and again:

...
[('749003', [[9081], [2000], [100]])]

So you can see I'm getting a random order to the final list. I want the final list to reflect the order of feattable. How can I force the union to preserve the order the RDDs were appended to the list?

have you tried sortBy? [wiki](https://spark.apache.org/docs/1.1.1/api/python/pyspark.rdd.RDD-class.html) — Rahul Sharma, Sep 18 '17 at 17:54
The problem is not the `union` but the `reduceByKey` which shuffles the data — MaFF, Sep 18 '17 at 18:15

Spark union RDDs and preserve order

0 Answers0