How to use string variables in VectorAssembler in Pyspark

Question

I want to run Random Forests algorithm on Pyspark. It is mentioned in the Pyspark documentation that VectorAssembler accepts only numerical or boolean datatypes. So, if my data contains Stringtype variables, say names of cities, should I be one-hot encoding them in order to proceed further with Random Forests classification/regression?

Here is the code I have been trying, input file is here:

train=sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('filename')
drop_list = ["Country", "Carrier", "TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))#only this variable is actually double, rest of them are strings
junk = train.select([column for column in train.columns if column in drop_list])
transformed = assembler.transform(junk)

I keep getting the errror that IllegalArgumentException: u'Data type StringType is not supported.'

P.S.: Apologies for asking a basic question. I come from R background. In R, when we do Random Forests, there is no need to convert the categorical variables into numeric variables.

Related [question](https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe) that can also be useful. You just need to concat the `indexers` into your pipeline. — Ric S, Mar 30 '21 at 14:16

score 5 · Accepted Answer · answered Sep 21 '17 at 07:27

5

Yes you should use StringIndexer, maybe together with OneHotEncoder. You can find more information on these two in the linked documentation.

answered Sep 21 '17 at 07:27

Mariusz

13,481
3
60
64

score 1 · Answer 2 · answered Jul 16 '18 at 11:36

Following is the example -
Schema
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: double (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: double (nullable = true)
 |-- capital-loss: double (nullable = true)
 |-- hours-per-week: double (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)

        // Deal with Categorical Columns
        // Transform string type columns to string indexer 
        val workclassIndexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
        val educationIndexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
        val maritalStatusIndexer = new StringIndexer().setInputCol("marital-status").setOutputCol("maritalStatusIndex")
        val occupationIndexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
        val relationshipIndexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
        val raceIndexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
        val sexIndexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
        val nativeCountryIndexer = new StringIndexer().setInputCol("native-country").setOutputCol("nativeCountryIndex")
        val incomeIndexer = new StringIndexer().setInputCol("income").setOutputCol("incomeIndex")

        // Transform string type columns to string indexer 
        val workclassEncoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")
        val educationEncoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")
        val maritalStatusEncoder = new OneHotEncoder().setInputCol("maritalStatusIndex").setOutputCol("maritalVec")
        val occupationEncoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")
        val relationshipEncoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")
        val raceEncoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")
        val sexEncoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")
        val nativeCountryEncoder = new OneHotEncoder().setInputCol("nativeCountryIndex").setOutputCol("nativeCountryVec")
        val incomeEncoder = new StringIndexer().setInputCol("incomeIndex").setOutputCol("label")

    // Assemble everything together to be ("label","features") format
        val assembler = (new VectorAssembler()
          .setInputCols(Array("workclassVec", "fnlwgt", "educationVec", "education-num", "maritalVec", "occupationVec", "relationshipVec", "raceVec", "sexVec", "capital-gain", "capital-loss", "hours-per-week", "nativeCountryVec"))
          .setOutputCol("features"))

 ///////////////////////////////
    // Set Up the Pipeline ///////
    /////////////////////////////
    import org.apache.spark.ml.Pipeline

    val lr = new LogisticRegression()

    val pipeline = new Pipeline().setStages(Array(workclassIndexer, educationIndexer, maritalStatusIndexer, occupationIndexer, relationshipIndexer, raceIndexer, sexIndexer, nativeCountryIndexer, incomeIndexer, workclassEncoder, educationEncoder, maritalStatusEncoder, occupationEncoder, relationshipEncoder, raceEncoder, sexEncoder, nativeCountryEncoder, incomeEncoder, assembler, lr))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(training)

How to use string variables in VectorAssembler in Pyspark

2 Answers2

Linked