From the pyspark docs, wholeTextFiles()
:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
So your code:
files = sc.wholeTextFiles ("file:///data/*/*/")
creates an rdd
which contains records of the form:
(file_name, file_contents)
Getting the contents of the files is then just a simple map operation to get the second element of this tuple:
message = files.map(lambda x: x[1])
message
is now another rdd
that contains only the file contents.
More relevant information about wholeTextFiles()
and how it differs from textFile()
can be found at this post.