1

I have a dataframe in spark and I want to manually map the values of one of the columns:

 Col1
  Y
  N
  N
  Y
  N
  Y

I want "Y" to be equal to 1 and "N" to be equal to 0, like this:

 Col1
  1
  0
  0
  1
  0
  1

I have tried StringIndexer, but it I think it randomly encodes the categorical data. (I am not sure)

The python equivalent for this is:

df["Col1"] = df["Col1"].map({"Y": 1, "N": 0})

Can you please help me on how can I achieve this in Pyspark?

Rishab Gupta
  • 561
  • 3
  • 17
  • 2
    StringIndexer will encode based on the descending frequency of the level. Anyways what you want is df.withColumn("Col1", when(df['Col1'] == 'Y', 1). otherwise (0)) – sramalingam24 Mar 31 '19 at 04:46
  • Or you can simply do `df.withColumn("Col1", (df["Col1"]=="Y").cast("int"))` – pault Mar 31 '19 at 05:16

1 Answers1

0

Since you want to map the values to 1 and 0, an easy way is to specify a boolean condition and cast the result to int

from pyspark.sql.functions import col
df.withColumn("Col1", (col("Col1")=="Y").cast("int"))

For a more general case, you can use pyspark.sql.functions.when to implement if-then-else logic:

from pyspark.sql.functions import when
df.withColumn("Col1", when(col("Col1").isin(["Y"]), 1).otherwise(0))
pault
  • 41,343
  • 15
  • 107
  • 149