For example I have DataFrame with categorical features in name
:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("example")
.config("spark.some.config.option", "some-value").getOrCreate()
features = [(['a', 'b', 'c'], 1),
(['a', 'c'], 2),
(['d'], 3),
(['b', 'c'], 4),
(['a', 'b', 'd'], 5)]
df = spark.createDataFrame(features, ['name','id'])
df.show()
Out:
+---------+----+
| name| id |
+---------+----+
|[a, b, c]| 1|
| [a, c]| 2|
| [d]| 3|
| [b, c]| 4|
|[a, b, d]| 5|
+---------+----+
What I want to get:
+--------+--------+--------+--------+----+
| name_a | name_b | name_c | name_d | id |
+--------+--------+--------+--------+----+
| 1 | 1 | 1 | 0 | 1 |
+--------+--------+--------+--------+----+
| 1 | 0 | 1 | 0 | 2 |
+--------+--------+--------+--------+----+
| 0 | 0 | 0 | 1 | 3 |
+--------+--------+--------+--------+----+
| 0 | 1 | 1 | 0 | 4 |
+--------+--------+--------+--------+----+
| 1 | 1 | 0 | 1 | 5 |
+--------+--------+--------+--------+----+
I found the same queston but there is nothing helpful.
I tried to use VectorIndexer
from PySpark.ML
but I faced some problems with a transform of name
field to vector type
.
from pyspark.ml.feature import VectorIndexer
indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)
indexerModel = indexer.fit(df)
I get the following error:
Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType
I found a solution here but it looks overcomplicated. However, I'm not sure if it can be done only with VectorIndexer
.