I am new to Spark 2. I tried Spark tfidf example
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)
for each in featurizedData.collect():
print(each)
It outputs
Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))
I expected that in rawFeatures
I will get term frequencies like {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}
. Because terms frequency is:
tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)
In our case is : tf(w) = 1/5 = 0.2
for each word, because each word apears once in a document.
If we imagine that output rawFeatures
dictionary contains word index as key, and number of word appearances in a document as value, why key 1
is equal to 3.0
? There no word that appears in a document 3 times.
This is confusing for me. What am I missing?