I am new to spark, of cause. My data set contains immense number of columns with categorical variables. I would like to utilize feature vectors to store this categoricals and use, per say, VectorIndexer to perform mappings from categorical to ordinal and backwards in a a convinient way.
So, I want to achieve something as simple as this (pyspark notation):
df = spark.createDataFrame(
[
(0, Vectors.dense([0.1, 0.2])),
(1, Vectors.dense([0.1, 0.2])),
(2, Vectors.dense([0.2, 1.2])),
(3, Vectors.dense([0.1, 0.2])),
(4, Vectors.dense([0.1, 2.2])),
(5, Vectors.dense([0.1, 0.2]))],
["id", "features"]
)
But for string features:
# shall not work. For demonstration purposes only
df = spark.createDataFrame(
[
(0, Vectors.dense(['a', 'x'])),
(1, Vectors.dense(['b', 'x'])),
(2, Vectors.dense(['c', 'y'])),
(3, Vectors.dense(['a', 'z'])),
(4, Vectors.dense(['z', 'x'])),
(5, Vectors.dense(['c', 'z']))],
["id", "features"]
)
I guess that Vector class is not supposed to work with strings, but would love to here your suggestions about the nicest way to get it working.