I am trying to do a bigram count using Spark, Python API.
I am getting strange output. Multiple lines of:
generator object genexpr at 0x11aab40
This is my code:
from pyspark import SparkConf, SparkContext
import string
conf = SparkConf().setMaster('local').setAppName('BigramCount')
sc = SparkContext(conf = conf)
RDDvar = sc.textFile("file:///home/cloudera/Desktop/smallTest.txt")
sentences = RDDvar.flatMap(lambda line: line.split("."))
words = sentences.flatMap(lambda line: line.split(" "))
bigrams = words.flatMap(lambda x:[((x[i],x[i+1]) for i in range(0,len(x)-1))])
result = bigrams.map(lambda bigram: bigram, 1)
aggreg1 = result.reduceByKey(lambda a, b: a+b)
result.saveAsTextFile("file:///home/cloudera/bigram_out")
What is going wrong?