I have a Spark RDD where each element is a tuple in the form (key, input)
. I would like to use the pipe
method to pass the inputs to an external executable and generate a new RDD of the form (key, output)
. I need the keys for correlation later.
Here is an example using the spark-shell:
val data = sc.parallelize(
Seq(
("file1", "one"),
("file2", "two two"),
("file3", "three three three")))
// Incorrectly processes the data (calls toString() on each tuple)
data.pipe("wc")
// Loses the keys, generates extraneous results
data.map( elem => elem._2 ).pipe("wc")
Thanks in advance.