4

Say I have categorical features in a dataframe. In order to do ML on the dataframe, I do one hot encoding on the categorical columns using OneHotEncoderEstimator() and then use VectorAssembler() to assemble all the features to a single column. When reading the Spark docs I've seen the use of VectorIndexer() to index categorical features in a features vector column. If I have already performed one hot encoding on the categorical columns before formulating the features vector column, is there any point in applying the VectorIndexer() on it.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
rasthiya
  • 650
  • 1
  • 6
  • 20
  • Whoever is down voting the question, can you please specify the reason for the down vote – rasthiya Jan 16 '19 at 18:22
  • Your question is way too broad (please see [How to ask](https://stackoverflow.com/help/how-to-ask)); this sounds exactly like a question where you should have done the experiment you describe verbally, and if still having *coding* issues/questions, post a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). – desertnaut Jan 16 '19 at 18:27
  • Thank you. I have done experiments and it doesn't seem to make a difference with or without VectorIndexer, (at least in my case). I just wanted to know the opinion from someone who has a better understanding of the specifics. I know that there is a downside to the indexed variables compared to the OHE because of ranking. – rasthiya Jan 16 '19 at 18:35
  • You are very welcome to share your experiments here, instead of expecting us to re-create them from scratch based only on what we may have understood (or not) from your verbal description. Notice also that SO is not meant for *opinion*-based questions... – desertnaut Jan 16 '19 at 18:38

1 Answers1

2

OneHotEncoder(Estimator) and VectorIndexer are quite different beasts and are not exchangeable. OneHotEncoder(Estimator) is used primarily when the downstream process uses a linear model (it can be also used with Naive Bayes).

Let's consider a simple Dataset

val df = Seq(1.0, 2.0, 3.0).toDF

and a Pipeline

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val m1 = new Pipeline().setStages(Array(
  new OneHotEncoderEstimator()
   .setInputCols(Array("value")).setOutputCols(Array("features"))
)).fit(df)

If such model is applied to our data it will be one-hot-encoded (depending on a configuration OneHotEncoderEstimator supports both one-hot-encoding and dummy encoding) - in other words each level, excluding reference will be represented as a separate binary column:

m1.transform(df).schema("features").metadata
 org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"binary":[{"idx":0,"name":"0"},{"idx":1,"name":"1"},{"idx":2,"name":"2"}]},"num_attrs":3}}

Please note that such representation is inefficient and impractical to use with algorithms which handle categorical features natively.

In contrast, VectorIndexer only analyzes the data, and adjust metadata accordingly

val m2 = new Pipeline().setStages(Array(
  new VectorAssembler().setInputCols(Array("value")).setOutputCol("raw"),
  new VectorIndexer().setInputCol("raw").setOutputCol("features")
)).fit(df)

m2.transform(df).schema("features").metadata
org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"nominal":[{"ord":false,"vals":["1.0","2.0","3.0"],"idx":0,"name":"value"}]},"num_attrs":1}}

In other words it is more or less equivalent to a vectorized variant of StringIndexer (you can achieve a similar result, with more control over the output, using a set of StringIndexers followed by VectorAssembler).

Such features are unsuitable for linear models, but are valid input for decision trees and tree ensembles.

To summarize - in practice OneHotEncoder(Esitmator) and VectorIndexer are mutually exclusive and the choice, of which one should be used, depends on the downstream process.

10465355
  • 4,481
  • 2
  • 20
  • 44