pyspark - multiple input files into one RDD and one output file

Question

I have a wordcount in Python that I want to run on Spark with multiple text files and get ONE output file, so the words are counted in all files altogether. I tried a few solutions for example the ones found here and here, but it still gives the same number of output files as the number of input files.

rdd = sc.textFile("file:///path/*.txt")
input = sc.textFile(join(rdd))

or

rdd = sc.textFile("file:///path/f0.txt,file:///path/f1.txt,...")
rdds = Seq(rdd)
input = sc.textFile(','.join(rdds))

or

rdd = sc.textFile("file:///path/*.txt")
input = sc.union(rdd)

don't work. Can anybody suggest a solution how to make one RDD of a few input text files?

Thanks in advance...

It means the same as a few lines above - the program returns the same number of output files as the number of input files. — piterd, Feb 24 '16 at 17:03

Mohitt · Accepted Answer · 2016-02-24T19:20:26.400

9

This should load all the files matching the pattern.

rdd = sc.textFile("file:///path/*.txt")

Now, you do not need to do any union. You have only one RDD.

Coming to your question - why are you getting many output files. The number of output files depends on number of partitions in the RDD. When you run word count logic, your resultant RDD can have more than 1 partitions. If you want to save the RDD as single file, use coalesce or repartition to have only one partition.

The code below works, taken from Examples.

rdd = sc.textFile("file:///path/*.txt")
counts = rdd.flatMap(lambda line: line.split(" ")) \
...              .map(lambda word: (word, 1)) \
...              .reduceByKey(lambda a, b: a + b)

counts.coalesce(1).saveAsTextFile("res.csv")

edited Feb 24 '16 at 19:20

answered Feb 24 '16 at 17:36

Mohitt

2,957
3
29
52

Does it mean that all the actions and operations run on several partitions and only the output is merged into one thing? For example the word 'the' is counted separately for each partition (file) and there will be several values for the same word in the output file? If yes - perhaps `coalesce()` could be used as first or somewhere at the beginning?... – piterd Feb 24 '16 at 18:01
No, whenever u perform action/transformation you get new RDD. In above-mentioned example, counts Rdd is resultant RDD. A word 'the' will appear only once in counts RDD. – Mohitt Feb 24 '16 at 18:07
You can checkout this presentation for more details - https://www.slideshare.net/mobile/mohitgargk/spark-101-58162162 – Mohitt Feb 24 '16 at 18:08
Your solution works perfectly @Mohitt. Thank you for that and for the presentation. – piterd Feb 24 '16 at 19:18

pyspark - multiple input files into one RDD and one output file

1 Answers1