I want to one-hot encode multiple categorical features using pyspark
(version 2.3.2). My data is very large (hundreds of features, millions of rows). The problem is that pyspark's OneHotEncoder
class returns its result as one vector column. I need to have the result as a separate column per category. What is the most efficient way to achieve this?
Option 1 & 2
If you indeed use OneHotEncoder
, the problem is how to transform that one vector column to separate binary columns. When searching around, I find that most other answers seem to use some variant of this answer which gives two options, both of which look very inefficient:
- Use a
udf
. But I would like to avoid that since udf's are inherently inefficient, especially if I have to repeatedly use one for every categorical feature I want to one-hot encode. - Convert to
rdd
. Then for each row extract all the values and append the one-hot encoded features. You can find a good example of this here:
def extract(row):
return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
result = df.rdd.map(extract).toDF(allColNames)
The problem is that this will extract each value for each row and then convert that back into a dataframe. This sounds horrible if you need to do this for a dataframe with hundreds of features and millions of rows, especially if you need to do it dozens of times.
Is there another, more efficient way to convert a vector column to separate columns that I'm missing? If not maybe it's better not to use OneHotEncoder
at all, since I can't use the result efficiently. Which leads me to option 3.
Option 3
Option 3 would be to just one-hot encode myself using reduce
statements and withColumn
. For example:
df = reduce(
lambda df, category: df.withColumn(category, sf.when(col('categoricalFeature') == category, sf.lit(1)))
categories,
df
)
Where categories
is of course the list of possible categories in the feature categoricalFeature
.
Before I spend hours/days implementing this, waiting for the results, debugging, etc. I wanted to ask if anyone has any experience with this? Is there an efficient way I'm missing? If not which of the three would probably be fastest?