Pyspark Dataframe One-Hot Encoding

Question

I am doing data preparation on the Spark DataFrame with categorical data. I need to do One-Hot-Encoding on the categorical data and I tried this on spark 1.6

sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(dropLast=False, inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show()

This piece of code resulted in one-hot encoded data in this format.

+---+-------------+
| id|  categoryVec|
+---+-------------+
|  0|(3,[0],[1.0])|
|  1|(3,[2],[1.0])|
|  2|(3,[1],[1.0])|
|  3|(3,[0],[1.0])|
|  4|(3,[0],[1.0])|
|  5|(3,[1],[1.0])|
+---+-------------+

Usually, what I expect from a One-Hot Encoding technique is each column per each category and 0,1 respective values. How can get that kind of data from this?

To have multiple columns instead of a Sparse vector is memory inefficient, why would you want that? — David Arenburg, Jul 04 '17 at 11:25
When I feed these vector type to a ML Algorithm, will that accept? — Jack Daniel, Jul 04 '17 at 11:29
Ofcourse. this is the whole idea of OHE- to convert your data into a format that an algorithm can handle with see [this](https://stackoverflow.com/questions/32982425/encode-and-assemble-multiple-features-in-pyspark/32984795#32984795) — David Arenburg, Jul 04 '17 at 12:40

Pyspark Dataframe One-Hot Encoding

0 Answers0