Dummy Encoding using Pyspark

Question

I am hoping to dummy encode my categorical variables to numerical variables like shown in the image below, using Pyspark syntax.

I read in data like this

data = sqlContext.read.csv("data.txt", sep = ";", header = "true")

In python I am able to encode my variables using the below code

data = pd.get_dummies(data, columns = ['Continent'])

However I am not sure how to do it in Pyspark.

Any assistance would be greatly appreciated.

score 11 · Accepted Answer · answered Oct 05 '17 at 14:07

11

Try this:

import pyspark.sql.functions as F 
categ = df.select('Continent').distinct().rdd.flatMap(lambda x:x).collect()
exprs = [F.when(F.col('Continent') == cat,1).otherwise(0)\
            .alias(str(cat)) for cat in categ]
df = df.select(exprs+df.columns)

Exclude df.columns if you do not want the original columns in your transformed dataframe.

answered Oct 05 '17 at 14:07

mayank agrawal

2,495
2
13
32

3

Is there a faster solution? – Sade Oct 05 '18 at 14:10
1

it also uses alot of memory – Sade Oct 05 '18 at 14:38
This works well. High memory usage is normal I guess, if you have a lot of unique values in the column. – Soumyajit Jul 12 '19 at 13:49

Dummy Encoding using Pyspark

1 Answers1

Linked