This might be Naive, but I just started with PySpark and Spark. Please help me understanding the One Hot Technique in Pyspark. I am trying to do OneHotEncoding on one of the column. After one hot encoding, the dataframe schema adds avector. But to apply Machine Learning algorithm, that should be an individual columns added to the existing data frame with each column represents a category, but not the vector type column. How can validate the OneHotEncoding.
My Code:
stringIndexer = StringIndexer(inputCol="business_type", outputCol="business_type_Index")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(dropLast=False, inputCol="business_type_Index", outputCol="business_type_Vec")
encoded = encoder.transform(indexed)
encoded.select("business_type_Vec").show()
This display:
+-----------------+
|business_type_Vec|
+-----------------+
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
| (2,[0],[1.0])|
+-----------------+
only showing top 20 rows
The newly added column is of vector type. How can I convert that to individual columns of each category