How do I grab a value from a RDD in pyspark?

Question

I have this code:

files = sc.wholeTextFiles ("file:///data/*/*/")

So, when I run the above command, I get this:

[('file:/data/file.txt',  'Message')]

How do I grab the 'Message' part and not the file name from this RDD in pyspark?

I have this code:

val message = files.map(x = > x._2)

but does not work.

The code you tried looks like scala, but you're asking about python. Direct translation of your code would be `message = files.map(lambda x, x[1])` but this seems like an XY problem. What is it that you're trying to do? — pault, Feb 17 '18 at 22:44
True, that looks like scala, but trying to get second tuple. I don't need the file name, but just the message. How would I write the scala code in pyspark? — Steve McAffer, Feb 17 '18 at 22:46
I get a "SyntaxError: invalid syntax" and it points to the first [ in the line. Can you help? — Steve McAffer, Feb 17 '18 at 22:57
message = files.map(lambda x: x[1]). This worked! Thanks for your help! — Steve McAffer, Feb 17 '18 at 23:00

score 1 · Accepted Answer · answered Feb 17 '18 at 23:34

1

This is how you would do in scala

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))

answered Feb 17 '18 at 23:34

Ruzbeh Irani

2,318
18
10

score 1 · Answer 2 · answered Feb 18 '18 at 03:18

From the pyspark docs, wholeTextFiles():

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

So your code:

files = sc.wholeTextFiles ("file:///data/*/*/")

creates an rdd which contains records of the form:

(file_name,  file_contents)

Getting the contents of the files is then just a simple map operation to get the second element of this tuple:

message = files.map(lambda x: x[1])

message is now another rdd that contains only the file contents.

More relevant information about wholeTextFiles() and how it differs from textFile() can be found at this post.

How do I grab a value from a RDD in pyspark?

2 Answers2