0

I'm reading data from three separate CSVs into Spark RDDs (using pyspark).

sc = SparkContext()
rdd_list = []
feattable = ['csv1', 'csv2', 'csv3']
for idx, ftable in enumerate(feattable):
    print("Loading data from: %s" % ftable, idx)
    thisRdd = sc.textFile('hdfs:' + ftable).mapPartitions(lambda line: read_line_csv(line))
    thisRdd = thisRdd.map("some mapping no relevant to question")\
        .reduceByKey("some reducing no relevant to question")\
        .map("final mapping").persist()
    rdd_list.append(thisRdd)

rdd = sc.union(rdd_list).reduceByKey(lambda x,y : x+y)
print(rdd.take(5))

During the final mapping, for debugging, I get a tuple of (user_id, list(len(csv entries per user)).

I run the code once and get:

Loading data from: csv1 0
Loading data from: csv2 1
Loading data from: csv3 2
[('749003', [[2000], [9081], [100]])]

running again I get:

...
[('749003', [[9081], [100], [2000]])]

and again:

...
[('749003', [[9081], [2000], [100]])]

So you can see I'm getting a random order to the final list. I want the final list to reflect the order of feattable. How can I force the union to preserve the order the RDDs were appended to the list?

Sal
  • 1,653
  • 6
  • 23
  • 36

0 Answers0