Apply StringIndexer to change columns in a PySpark Dataframe

Question

I am new to pyspark. I want to apply StringIndexer to change the value of the column to index. I checked this post: Apply StringIndexer to several columns in a PySpark Dataframe

This solution will create a new column rather than updating the input column. Is there a way to update the currrent column? I tried to use the same name for input and output, but it does not work.

label_stringIdx = StringIndexer(inputCol ="WindGustDir", outputCol = "WindGustDir_index")

score 1 · Accepted Answer · answered Oct 14 '19 at 05:46

You cannot simply update that column. But what you can do is

create a new column using the StringIndexer
delete the original column
rename the new column with the name of the original column

You can use this code

from pyspark.ml.feature import StringIndexer
import pyspark.sql.functions as F


df = spark.createDataFrame([['a', 1], ['b', 1], ['c', 2], ['b', 5]], ['WindGustDir', 'value'])
df.show()
# +-----------+-----+
# |WindGustDir|value|
# +-----------+-----+
# |          a|    1|
# |          b|    1|
# |          c|    2|
# |          b|    5|
# +-----------+-----+

# 1. create new column
label_stringIdx = StringIndexer(inputCol ="WindGustDir", outputCol = "WindGustDir_index")
label_stringIdx_model = label_stringIdx.fit(df)
df = label_stringIdx_model.transform(df)

# 2. delete original column
df = df.drop("WindGustDir")

# 3. rename new column
to_rename = ['WindGustDir_index', 'value']
replace_with = ['WindGustDir', 'value']
mapping = dict(zip(to_rename, replace_with))
df = df.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])

df.show()

# +-----------+-----+
# |WindGustDir|value|
# +-----------+-----+
# |        1.0|    1|
# |        0.0|    1|
# |        2.0|    2|
# |        0.0|    5|
# +-----------+-----+

Apply StringIndexer to change columns in a PySpark Dataframe

1 Answers1