I'm reading data from three separate CSVs into Spark RDDs (using pyspark).
sc = SparkContext()
rdd_list = []
feattable = ['csv1', 'csv2', 'csv3']
for idx, ftable in enumerate(feattable):
print("Loading data from: %s" % ftable, idx)
thisRdd = sc.textFile('hdfs:' + ftable).mapPartitions(lambda line: read_line_csv(line))
thisRdd = thisRdd.map("some mapping no relevant to question")\
.reduceByKey("some reducing no relevant to question")\
.map("final mapping").persist()
rdd_list.append(thisRdd)
rdd = sc.union(rdd_list).reduceByKey(lambda x,y : x+y)
print(rdd.take(5))
During the final mapping, for debugging, I get a tuple of (user_id, list(len(csv entries per user))
.
I run the code once and get:
Loading data from: csv1 0
Loading data from: csv2 1
Loading data from: csv3 2
[('749003', [[2000], [9081], [100]])]
running again I get:
...
[('749003', [[9081], [100], [2000]])]
and again:
...
[('749003', [[9081], [2000], [100]])]
So you can see I'm getting a random order to the final list. I want the final list to reflect the order of feattable
. How can I force the union
to preserve the order the RDDs were appended to the list?