1

I'm struggling with creating dummy columns in a PySpark dataframe.

If I have a data frame with 10 columns (1 ID column, 9 object/string columns with n categories) In Python, I can simply do :

cols = list(df.columns)
cols.remove('ID')

df = pd.get_dummies(df[cols])

However, I cannot find a single resource where I can create the same result the above code references.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Nick
  • 185
  • 2
  • 8
  • It's not going to be so easy in spark. The `get_dummies` function works because pandas can enumerate all of the possible values in memory and then create new columns (also in-memory). Hard to do this in a distributed way, unless [you know/collect the possible categories ahead of time](https://stackoverflow.com/questions/42805663/e-num-get-dummies-in-pyspark). – pault Sep 15 '20 at 19:44
  • [You can also use `OneHotEncoder`](https://stackoverflow.com/questions/32277576/how-to-handle-categorical-features-with-spark-ml) but it's going to create a vector column. If your ultimate goal is to train an ml model in spark, this is probably the way to go. – pault Sep 15 '20 at 19:47

0 Answers0