Feature vector with strings in spark?

Question

I am new to spark, of cause. My data set contains immense number of columns with categorical variables. I would like to utilize feature vectors to store this categoricals and use, per say, VectorIndexer to perform mappings from categorical to ordinal and backwards in a a convinient way.

So, I want to achieve something as simple as this (pyspark notation):

df = spark.createDataFrame(
    [
      (0, Vectors.dense([0.1, 0.2])),
      (1, Vectors.dense([0.1, 0.2])),
      (2, Vectors.dense([0.2, 1.2])),
      (3, Vectors.dense([0.1, 0.2])),
      (4, Vectors.dense([0.1, 2.2])),
      (5, Vectors.dense([0.1, 0.2]))],
    ["id", "features"]
)

But for string features:

# shall not work. For demonstration purposes only  
df = spark.createDataFrame(
        [
           (0, Vectors.dense(['a', 'x'])), 
           (1, Vectors.dense(['b', 'x'])), 
           (2, Vectors.dense(['c', 'y'])), 
           (3, Vectors.dense(['a', 'z'])), 
           (4, Vectors.dense(['z', 'x'])), 
           (5, Vectors.dense(['c', 'z']))],
        ["id", "features"]
      )

I guess that Vector class is not supposed to work with strings, but would love to here your suggestions about the nicest way to get it working.

You could try using a StringIndexer, it will map a string to an index that you could use as a feature. — Shaido, May 24 '17 at 02:53
Possible duplicate of [Encode and assemble multiple features in PySpark](https://stackoverflow.com/questions/32982425/encode-and-assemble-multiple-features-in-pyspark) — zero323, May 24 '17 at 17:31

Feature vector with strings in spark?

0 Answers0