1

I want to one-hot encode multiple categorical features using pyspark (version 2.3.2). My data is very large (hundreds of features, millions of rows). The problem is that pyspark's OneHotEncoder class returns its result as one vector column. I need to have the result as a separate column per category. What is the most efficient way to achieve this?

Option 1 & 2

If you indeed use OneHotEncoder, the problem is how to transform that one vector column to separate binary columns. When searching around, I find that most other answers seem to use some variant of this answer which gives two options, both of which look very inefficient:

  1. Use a udf. But I would like to avoid that since udf's are inherently inefficient, especially if I have to repeatedly use one for every categorical feature I want to one-hot encode.
  2. Convert to rdd. Then for each row extract all the values and append the one-hot encoded features. You can find a good example of this here:
def extract(row):
    return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())

result = df.rdd.map(extract).toDF(allColNames)

The problem is that this will extract each value for each row and then convert that back into a dataframe. This sounds horrible if you need to do this for a dataframe with hundreds of features and millions of rows, especially if you need to do it dozens of times.

Is there another, more efficient way to convert a vector column to separate columns that I'm missing? If not maybe it's better not to use OneHotEncoder at all, since I can't use the result efficiently. Which leads me to option 3.

Option 3

Option 3 would be to just one-hot encode myself using reduce statements and withColumn. For example:

df = reduce(
    lambda df, category: df.withColumn(category, sf.when(col('categoricalFeature') == category, sf.lit(1)))
    categories,
    df
)

Where categories is of course the list of possible categories in the feature categoricalFeature.

Before I spend hours/days implementing this, waiting for the results, debugging, etc. I wanted to ask if anyone has any experience with this? Is there an efficient way I'm missing? If not which of the three would probably be fastest?

Willem
  • 976
  • 9
  • 24
  • Why don't you do `F.explode` on the vector column? – mck Nov 13 '20 at 17:47
  • If I'm not mistaken, `explode` only splits an array into rows, not columns? – Willem Nov 14 '20 at 22:25
  • Try this then: https://stackoverflow.com/questions/38384347/how-to-split-vector-into-columns-using-pyspark – mck Nov 15 '20 at 06:45
  • I already refer to that answer in my question including the reasons why I'm sceptical of implementing it for a huge dataframe. If these truly are the only options I might not even use `OneHotEncoder` and look for a more efficient way. The point of this question is to ask which of all these options is likely to be most efficient. – Willem Nov 15 '20 at 09:05

0 Answers0