Create Dummy Features for Multiple Columns

Asked Sep 15 '20 at 17:30

Active Sep 15 '20 at 17:45

Viewed 704 times

I'm struggling with creating dummy columns in a PySpark dataframe.

If I have a data frame with 10 columns (1 ID column, 9 object/string columns with n categories) In Python, I can simply do :

cols = list(df.columns)
cols.remove('ID')

df = pd.get_dummies(df[cols])

However, I cannot find a single resource where I can create the same result the above code references.

edited Sep 15 '20 at 17:45

Konrad Rudolph

asked Sep 15 '20 at 17:30

Nick

It's not going to be so easy in spark. The `get_dummies` function works because pandas can enumerate all of the possible values in memory and then create new columns (also in-memory). Hard to do this in a distributed way, unless [you know/collect the possible categories ahead of time](https://stackoverflow.com/questions/42805663/e-num-get-dummies-in-pyspark). – pault Sep 15 '20 at 19:44
[You can also use `OneHotEncoder`](https://stackoverflow.com/questions/32277576/how-to-handle-categorical-features-with-spark-ml) but it's going to create a vector column. If your ultimate goal is to train an ml model in spark, this is probably the way to go. – pault Sep 15 '20 at 19:47

0 Answers0