I have a wordcount in Python that I want to run on Spark with multiple text files and get ONE output file, so the words are counted in all files altogether. I tried a few solutions for example the ones found here and here, but it still gives the same number of output files as the number of input files.
rdd = sc.textFile("file:///path/*.txt")
input = sc.textFile(join(rdd))
or
rdd = sc.textFile("file:///path/f0.txt,file:///path/f1.txt,...")
rdds = Seq(rdd)
input = sc.textFile(','.join(rdds))
or
rdd = sc.textFile("file:///path/*.txt")
input = sc.union(rdd)
don't work. Can anybody suggest a solution how to make one RDD of a few input text files?
Thanks in advance...