0

Apache Spark MLLIB has HashingTF() function which takes tokenized words as input and converts those sets into fixed-length feature vectors.

As mentioned in documentation link spark mlib documentation

it is advisable to use power of two as the feature dimension.

The question is whether the exponent value is the number of terms in the input

If yes, Suppose If I consider more than 1000 text document as input which has more than 5000 terms , then the feature dimension become 2^5000

Whether my assumption is correct or is there any other way to find exponent value

Ranjana Girish
  • 473
  • 7
  • 17
  • Possible duplicate of [how to interpret RDD.treeAggregate](http://stackoverflow.com/questions/29860635/how-to-interpret-rdd-treeaggregate) – Umberto Griffo Apr 06 '17 at 15:22

1 Answers1

1

From the document HashingTF it said: "it is advisable to use power of two as the feature dimension" --> I think it means numFeatures = 2^n

For example your vocabulary size is 900, then numFeatures value should be > 900 and a power of two, which is 2^10 (=1024) could be a good estimate.

Tracy
  • 11
  • 2