6

I am new to Spark 2. I tried Spark tfidf example

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)

for each in featurizedData.collect():
    print(each)

It outputs

Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))

I expected that in rawFeatures I will get term frequencies like {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}. Because terms frequency is:

tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)

In our case is : tf(w) = 1/5 = 0.2 for each word, because each word apears once in a document. If we imagine that output rawFeatures dictionary contains word index as key, and number of word appearances in a document as value, why key 1 is equal to 3.0? There no word that appears in a document 3 times. This is confusing for me. What am I missing?

zero323
  • 322,348
  • 103
  • 959
  • 935

1 Answers1

4

TL;DR; It is just a simple hash collision. HashingTF takes hash(word) % numBuckets to determine the bucket and with very low number of buckets like here collisions are to be expected. In general you should use much higher number of buckets or, if collisions are unacceptable, CountVectorizer.

In detail. HashingTF by default uses Murmur hash. [u'hi', u'i', u'heard', u'about', u'spark'] will be hashed to [-537608040, -1265344671, 266149357, 146891777, 2101843105]. If you follow the source you'll see that the implementation is equivalent to:

import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes

Seq("hi", "i", "heard", "about", "spark")
  .map(UTF8String.fromString(_))
  .map(utf8 => 
    hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))
Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)

When you take non-negative modulo of these values you'll get [24, 1, 13, 1, 1]:

List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
  .map(nonNegativeMod(_, 32))
List[Int] = List(24, 1, 13, 1, 1)

Three words from the list (i, about and spark) hash to the same bucket, each occurs once, hence the result you get.

Related:

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935