Fastest way to binary encode a PySpark dataframe?

Question

I am trying to add Binary Encoding features to a PySpark dataframe and would like to know the fastest way to do so.

For example, given a DataFrame with cols {a, b, c}, I would like to create new cols {is_a_string, is_a_float, ...}, where the value for each col would be 1.0 or 0.0 depending on the datatype of the value for col a.

So far, I have tried UDFs. They work fine but are pretty slow. This seems like a simple task that I should be able to do with a built-in Spark function, but I can't find how to do so.

An example would be:

A table might look like

     a | b | c 
r1 | 1 | "" | NULL 
r2 | ""| "" | 1

We want to turn that into this:

     a | b | c | is_a_int | is_a_string | is_a_null 
r1 | 1 | "" | NULL | 1.0 | 0.0 | 0.0 
r2 | ""| "" | 1    | 0.0 | 1.0 | 0.0

with is_b_int, is_b_string, etc... also as new columns

loop over your columns and use the information from `df.dtypes` to populate your new columns. — pault, May 29 '19 at 20:17
Please provide a small [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). Don't post code in the comments, [edit] your question instead. — pault, May 29 '19 at 20:22
You can not have mixed type columns in a spark DataFrame. Your column `a` must be all integers or all strings. — pault, May 29 '19 at 20:41
Interesting... I think I misunderstood my problem. What if I wanted to perform a simple check where instead of checking data_type, I have a new column is_a_positive, which takes a certain value 1 or 0 depending on the actual value of a? — Lowblow, May 29 '19 at 20:46
I think you're looking for [Spark Equivalent of IF Then ELSE](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else) — pault, May 29 '19 at 20:58

Fastest way to binary encode a PySpark dataframe?

0 Answers0