I have a large number of parquet files in a directory that represents different tables of the same data schema and I want to merge them together into one big RDD. Ideally, I would like to do a map reduce where the mapper emits small RDD's and the reducer merges them. However, I couldnt figure out how to emit RDD's within a mapper. Any ideas?
The first line below generates the list of files in the directory and the second line should generate the complete RDD. However, it gives an unable to serialize error since I dont think you can create a RDD within a map instance.
arr = map(lambda x: ["/mnt/s3/rds/27jul2017-parquet/%s-%s-%s.parquet" % (x[0], x[1], x[2]), x[1].zfill(10), x[2].zfill(10)], map(lambda x: x.name.split('.')[0].split('-'), dbutils.fs.ls('/mnt/s3/rds/27jul2017-parquet/')))
result = sorted(arr, key=lambda x: x[1])
sc.parallelize(arr).map(lambda x: (1, spark.read.parquet(x[0]))).reduceByKey(lambda x,y: x.unionAll(y) )