1

I have this code:

files = sc.wholeTextFiles ("file:///data/*/*/")

So, when I run the above command, I get this:

[('file:/data/file.txt',  'Message')]

How do I grab the 'Message' part and not the file name from this RDD in pyspark?

I have this code:

val message = files.map(x = > x._2)

but does not work.

Steve McAffer
  • 375
  • 2
  • 7
  • 19
  • 1
    The code you tried looks like scala, but you're asking about python. Direct translation of your code would be `message = files.map(lambda x, x[1])` but this seems like an XY problem. What is it that you're trying to do? – pault Feb 17 '18 at 22:44
  • True, that looks like scala, but trying to get second tuple. I don't need the file name, but just the message. How would I write the scala code in pyspark? – Steve McAffer Feb 17 '18 at 22:46
  • 1
    I edited my original comment to add the python code. – pault Feb 17 '18 at 22:48
  • I get a "SyntaxError: invalid syntax" and it points to the first [ in the line. Can you help? – Steve McAffer Feb 17 '18 at 22:57
  • 1
    message = files.map(lambda x: x[1]). This worked! Thanks for your help! – Steve McAffer Feb 17 '18 at 23:00

2 Answers2

1

This is how you would do in scala

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))
Ruzbeh Irani
  • 2,318
  • 18
  • 10
1

From the pyspark docs, wholeTextFiles():

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

So your code:

files = sc.wholeTextFiles ("file:///data/*/*/")

creates an rdd which contains records of the form:

(file_name,  file_contents)

Getting the contents of the files is then just a simple map operation to get the second element of this tuple:

message = files.map(lambda x: x[1])

message is now another rdd that contains only the file contents.

More relevant information about wholeTextFiles() and how it differs from textFile() can be found at this post.

pault
  • 41,343
  • 15
  • 107
  • 149