Convert all nominal variables to categorical variables in pyspark

Question

I'm just after a way to convert all the String type variables in my PySpark dataframe to categorical variables so I can run a decision tree on the dataframe. I can't use pandas and can only use PySpark libraries due to resource constraints. I've identified VectorIndexer as a possible solution, however, I don't understand how to convert all String type columns which the documentation says is possible.

Could somebody help me with the syntax on how to do that? I'm after something like this:

featureIndexer = VectorIndexer(inputCol=<list of input columns>, outputCol=<list of output columns>, maxCategories=10).fit(df)

or letting the VectorIndexer figure out which ones need vectoring on it's own, which the documentation seems to indicate that it can do.

featureIndexer = VectorIndexer(df, maxCategories=10).fit(df)

Thanks in advance.

score 3 · Accepted Answer · answered Oct 10 '17 at 06:51

3

VectorIndexer takes a column of vector type as input, however, it sounds like you have a column with strings. In this case I would recommend to use StringIndexer and OneHotEncoder.

The StringIndexer will take a string column of labels to a column of label indices (doubles). The OneHotEncoder will then convert this column into multiple columns representing each category, to use as categorical features.

Afterwards, all these features can be combined into a single vector with an VectorAssembler. I would recommend the use of a pipeline to put all the stages together with the classifier.

Here is the documentation of the different available feature transformations as well as examples of how they work.

answered Oct 10 '17 at 06:51

Shaido

27,497
23
70
73

Thank you for the quick reply, I kinda figured I messed up with that too. Could you also suggest how I can convert all columns? I tried passing column names as a list to inputCol argument but that resulted in an error. The reason is that I have some 50 odd string columns that needs converting and don't want to do everything by hand. Thanks a lot! – words_of_wisdom Oct 10 '17 at 06:56
@words_of_wisdom You can take a look at [this](https://stackoverflow.com/a/36944716/7579547) answer that uses a pipeline to do the converting of multiple columns at once. – Shaido Oct 10 '17 at 07:04

Convert all nominal variables to categorical variables in pyspark

1 Answers1