cannot send pyspark output to a file in the local file system

Question

I'm running a pyspark job on spark (single node, stand-alone) and trying to save the output in a text file in the local file system.

input = sc.textFile(inputfilepath)
words = input.flatMap(lambda x: x.split())
wordCount = words.countByValue()

wordCount.saveAsTextFile("file:///home/username/output.txt")

I get an error saying

AttributeError: 'collections.defaultdict' object has no attribute 'saveAsTextFile'

Basically whatever I add to 'wordCount' object, for example collect() or map() it returns the same error. The code works with no problem when output goes to the terminal (with a for loop) but I can't figure what is missing to send the output to a file.

score 1 · Accepted Answer · edited May 23 '17 at 10:28

1

The countByValue() method that you're calling is returning a dictionary of word counts. This is just a standard python dictionary, and doesn't have any Spark methods available to it.

You can use your favorite method to save the dictionary locally.

edited May 23 '17 at 10:28

Community

1
1

answered Feb 19 '16 at 19:33

Kyle Heuton

9,318
4
40
52

Beat me to it. @Snoozer is 100% correct. countByValue doesnt create a new RDD, its a local dictionary. – Joe Widen Feb 19 '16 at 19:34
Thanks... I changed it to `map(lambda x: (str(x),1)).reduceByKey(add)` with `from operator import add` – piterd Feb 19 '16 at 20:20

cannot send pyspark output to a file in the local file system

1 Answers1