Pyspark: I want to manually map the values of one of the columns in my dataframe

Question

I have a dataframe in spark and I want to manually map the values of one of the columns:

 Col1
  Y
  N
  N
  Y
  N
  Y

I want "Y" to be equal to 1 and "N" to be equal to 0, like this:

I have tried StringIndexer, but it I think it randomly encodes the categorical data. (I am not sure)

The python equivalent for this is:

df["Col1"] = df["Col1"].map({"Y": 1, "N": 0})

Can you please help me on how can I achieve this in Pyspark?

StringIndexer will encode based on the descending frequency of the level. Anyways what you want is df.withColumn("Col1", when(df['Col1'] == 'Y', 1). otherwise (0)) — sramalingam24, Mar 31 '19 at 04:46
Or you can simply do `df.withColumn("Col1", (df["Col1"]=="Y").cast("int"))` — pault, Mar 31 '19 at 05:16

score 0 · Answer 1 · answered Mar 31 '19 at 05:21

Since you want to map the values to 1 and 0, an easy way is to specify a boolean condition and cast the result to int

from pyspark.sql.functions import col
df.withColumn("Col1", (col("Col1")=="Y").cast("int"))

For a more general case, you can use pyspark.sql.functions.when to implement if-then-else logic:

from pyspark.sql.functions import when
df.withColumn("Col1", when(col("Col1").isin(["Y"]), 1).otherwise(0))

1 Answers1