I believe this question deserves more elaborate answer. Let's start with this piece of code:
rdd.reduce(lambda x,y: "\n".join([x,y]))
Contrary to what you may think, it doesn't guarantee that the values are merged in a specific order. If you for example port it to Scala, you are likely to get completely mixed up result.
Next, there is no use in having RDD
with a single item. If you do:
- Data is not distributed - you are as good as having a local object.
- Consequently processing is not truly parallelized.
So if you have a single item and want to:
apply more transformation on it.
Just use plain Python objects.
Is wholeTextFiles
any better? It is not. With a single file it suffers from the same problem as keeping local object.
- With single file all data goes to a single partition.
- Processing is not distributed.
- Data is eagerly loaded and not distributed, so when the size of the input grows, you may expect executor failures.
Finally wholeTextFiles
implementation is fairly inefficient, so overall memory footprint in PySpark can few times larger than the size of the data.
You didn't provide enough context, but I'll make an educated guess and assume you want to separate blocks of data. If I am right you should use custom delimiter
(creating spark data structure from multiline record):
rdd = sc.newAPIHadoopFile(
'/tmp/asd.txt',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\n\n'}
).values()
which will split you data like this:
rdd.take(3)
# ["hello\ni'm Arya\ni'm 21 yrold", "Hello\ni'm Jack\ni'm 30.", 'i am ali.']