I am trying to add Binary Encoding features to a PySpark dataframe and would like to know the fastest way to do so.
For example, given a DataFrame with cols {a, b, c}, I would like to create new cols {is_a_string, is_a_float, ...}, where the value for each col would be 1.0 or 0.0 depending on the datatype of the value for col a.
So far, I have tried UDFs. They work fine but are pretty slow. This seems like a simple task that I should be able to do with a built-in Spark function, but I can't find how to do so.
An example would be:
A table might look like
a | b | c
r1 | 1 | "" | NULL
r2 | ""| "" | 1
We want to turn that into this:
a | b | c | is_a_int | is_a_string | is_a_null
r1 | 1 | "" | NULL | 1.0 | 0.0 | 0.0
r2 | ""| "" | 1 | 0.0 | 1.0 | 0.0
with is_b_int, is_b_string, etc... also as new columns