17

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word.

But the results are showing in this format.

[<hashIndexofHashBucketofWord1>,<hashIndexofHashBucketofWord2> ...]
,[termFrequencyofWord1, termFrequencyOfWord2 ....]

eg:

(1048576,[105,3116],[1.0,2.0])

I am able to get the index in hash bucket, using tf.indexOf("word").

But, how can I get the word using the index?

mrry
  • 125,488
  • 26
  • 399
  • 400
Srini
  • 3,334
  • 6
  • 29
  • 64

1 Answers1

26

Well, you can't. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there.

If you're using a large hash and number of unique tokens is relatively low then you can try to create a lookup table from bucket to possible tokens from your dataset. It is one-to-many mapping but if above conditions are met number of conflicts should be relatively low.

If you need a reversible transformation you can use combine Tokenizer and StringIndexer and build a sparse feature vector manually.

See also: What hashing function does Spark use for HashingTF and how do I duplicate it?

Edit:

In Spark 1.5+ (PySpark 1.6+) you can use CountVectorizer which applies reversible transformation and stores vocabulary.

Python:

from pyspark.ml.feature import CountVectorizer

df = sc.parallelize([
    (1, ["foo", "bar"]), (2, ["foo", "foobar", "baz"])
]).toDF(["id", "tokens"])

vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(df)
vectorizer.vocabulary
## ('foo', 'baz', 'bar', 'foobar')

Scala:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = sc.parallelize(Seq(
    (1, Seq("foo", "bar")), (2, Seq("foo", "foobar", "baz"))
)).toDF("id", "tokens")

val model: CountVectorizerModel = new CountVectorizer()
  .setInputCol("tokens")
  .setOutputCol("features")
  .fit(df)

model.vocabulary
// Array[String] = Array(foo, baz, bar, foobar)

where element at the 0th position corresponds to index 0, element at the 1st position to index 1 and so on.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 2
    I would just like to add that as you can see in the [docs](https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html) since 1.2.0, you can call indexOf(term) – Matt Apr 06 '16 at 18:00