0

I have a dataframe like this:

data = [
    (1,'a',"BS", 20, "M"),
    (2,'b',"MS", 20, "F"),
    (3,'c',"PHD", 21, "F"),
    (4,'d',"BS", 22, "M"),
]
schema = StructType().add("id","integer").add("name","string").add("degree","string").add("age", "integer").add("gender", "string")
df = spark.createDataFrame(data, schema=schema)

I am trying to convert gender, and degree columns to a catogrical variable and pass it to a Linear Regression modeling in PySpark.

I am using StringIndexer as shown below for each column individually, but I am wondering is there a way to implement this for a list of columns at once?

from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
varIdxer = StringIndexer(inputCol='degree',outputCol='varIdx').fit(df)
df = varIdxer.transform(df)
armin
  • 591
  • 3
  • 10
  • Does this answer your question? [Apply StringIndexer to several columns in a PySpark Dataframe](https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe) – werner Sep 23 '22 at 18:08

0 Answers0