I am doing data preparation on the Spark DataFrame with categorical data. I need to do One-Hot-Encoding on the categorical data and I tried this on spark 1.6
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(dropLast=False, inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show()
This piece of code resulted in one-hot encoded data in this format.
+---+-------------+
| id| categoryVec|
+---+-------------+
| 0|(3,[0],[1.0])|
| 1|(3,[2],[1.0])|
| 2|(3,[1],[1.0])|
| 3|(3,[0],[1.0])|
| 4|(3,[0],[1.0])|
| 5|(3,[1],[1.0])|
+---+-------------+
Usually, what I expect from a One-Hot Encoding technique is each column per each category and 0,1 respective values. How can get that kind of data from this?