I have a dataframe like this:
data = [
(1,'a',"BS", 20, "M"),
(2,'b',"MS", 20, "F"),
(3,'c',"PHD", 21, "F"),
(4,'d',"BS", 22, "M"),
]
schema = StructType().add("id","integer").add("name","string").add("degree","string").add("age", "integer").add("gender", "string")
df = spark.createDataFrame(data, schema=schema)
I am trying to convert gender, and degree columns to a catogrical variable and pass it to a Linear Regression modeling in PySpark.
I am using StringIndexer as shown below for each column individually, but I am wondering is there a way to implement this for a list of columns at once?
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
varIdxer = StringIndexer(inputCol='degree',outputCol='varIdx').fit(df)
df = varIdxer.transform(df)