PySpark- OneHotEncoding

Question

This might be Naive, but I just started with PySpark and Spark. Please help me understanding the One Hot Technique in Pyspark. I am trying to do OneHotEncoding on one of the column. After one hot encoding, the dataframe schema adds avector. But to apply Machine Learning algorithm, that should be an individual columns added to the existing data frame with each column represents a category, but not the vector type column. How can validate the OneHotEncoding.

My Code:

    stringIndexer = StringIndexer(inputCol="business_type", outputCol="business_type_Index")
    model = stringIndexer.fit(df)
    indexed = model.transform(df)
    encoder = OneHotEncoder(dropLast=False, inputCol="business_type_Index", outputCol="business_type_Vec")
    encoded = encoder.transform(indexed)
    encoded.select("business_type_Vec").show()

This display:

+-----------------+
|business_type_Vec|
+-----------------+
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
|    (2,[0],[1.0])|
+-----------------+
only showing top 20 rows

The newly added column is of vector type. How can I convert that to individual columns of each category

This is expected behaviour, you don't need to convert to individual columns, as spark ML works with feature vectors. — mtoto, Oct 27 '16 at 16:16

score 0 · Answer 1 · edited May 23 '17 at 10:34

You probably already have an answer, but maybe it will be helpful for someone else. For vector split, you can use this answer (I've checked that it works):

How to split dense Vector into columns - using pyspark

However I don't think you need to convert vector back to columns (as mtoto already said), as all models in spark actually require you to provide input features in vector format (please correct me if I am wrong).

PySpark- OneHotEncoding

1 Answers1