0

My question is similar to this one but for Spark and the original question does not have a satisfactory answer.

I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows:

val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
  .setNumFeatures(30000) 
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
  .setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
  .setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx",  "hash-tfidf"))
  .setOutputCol("features")

If there are 30,000 words features won't these swamp the month? Or is VectorAssembler smart enough to handle this. (And if possible how do I get the best features of this model?)

Shaido
  • 27,497
  • 23
  • 70
  • 73
schoon
  • 2,858
  • 3
  • 46
  • 78

1 Answers1

1

VectorAssembler will simply combine all the data into a single vector, it does nothing with weights or anything else.

Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. You can train a model and check the weights of the features to confirm this. Simply use the provided coefficients method of the LinearSVCModel to see how much the features influence the final sum:

val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients

The features with higher coefficients will have a higher influence on the final result.

If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol() method.

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Thanks! Could you explain briefly how to 'check the weights of the features to confirm this. Simply use the provided coefficients method of the LinearSVCModel.' – schoon Jan 08 '18 at 07:04
  • @schoon: Added some more explaination – Shaido Jan 08 '18 at 07:13