1

I am trying to extract the feature from my raw data.

My raw data is a Seq[String].

I want to turn this into a OneHot encoding with several 1 instead of only one but it seems that the spark ml https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator is only accepting a single String as input.

Maybe I am blind, but I can't seem to find one which accept a list of string.

Thank you.

Wonay
  • 1,160
  • 13
  • 35
  • Use `CountVectorizer` or `HashingTF` in binary variant? – Alper t. Turker Jul 18 '18 at 17:34
  • So I used `HashingTF` but how are you able to go back from the encoding to the token ? I am going to take a look at `CountVectorizer` – Wonay Jul 18 '18 at 17:51
  • If you need details go with `CountVectorizer` - [How to get word details from TF Vector RDD in Spark ML Lib?](https://stackoverflow.com/q/32285699/8371915) – Alper t. Turker Jul 18 '18 at 18:03

1 Answers1

0

Thank you @user8371915

Reading https://spark.apache.org/docs/2.2.0/ml-features.html#countvectorizer it seems like it is exactly what I need.

More info: How to get word details from TF Vector RDD in Spark ML Lib?

Wonay
  • 1,160
  • 13
  • 35