I'm new to Spark. I'm trying to implement tf-idf. I need to calculate how many times each word occurs in each document and number of total words in each document.
I want to make reduce and possibly another operation but I don't know yet how. Here's the input I have:
Pairs are of the form (documentName , (word, wordCount))
ex.
("doc1", ("a", 3)), ("doc1", ("the", 2)), ("doc2", ("a", 5)),
("doc2",("have", 5))
Keys are documents and values are words and how many times that word occurs in that document. I want to count total words in every document and possibly calculate percentage of that word.
Output I want:
("doc1", (("a", 3), 5)) , ("doc1", (("the", 2), 5)),
("doc2", (("a", 5),10)), ("doc2", (("have", 5),10))
I get the effect by
corpus.join(corpus.reduceByKey(lambda x, y : x[1]+y[1]))
Starting point :
collect_of_docs = [(doc1,text1), (doc2, text2),....]
def count_words(x):
l = []
words = x[1].split()
for w in words:
l.append(((w, x[0]), 1))
return l
sc = SparkContext()
corpus = sc.parallelize(collect_of_docs)
input = (corpus
.flatMap(count_words)
.reduceByKey(add)
.map(lambda ((x,y), z) : (y, (x,z))))
If possible I want to make only one reduce operation with a tricky operator maybe. Any help is appreciated :) Thanks in advance.