Get single column from dummy columns in PySpark

Question

I have a PySpark df like the following, where event, 2020-01 and 2020-02 are dummies (in total I have 18 month).

df = (
    sc.parallelize([
        ("A", 1, 0, 1), ("B", 0, 1, 0), ("C", 0, 1, 0),
        ("D", 1, 1, 1),
    ]).toDF(["id", "event", "2020-01", "2020-02"])
)

id  event   2020-01   2020-02
A   1       0         1
B   0       1         0
C   0       1         0
D   1       1         1

and I want to create a new df with id, event and month, where month is a column created from "2020-01" and "2020-02" where those columns == 1. So the desired df would look like the following:

id  event   month
A   1       2020-02
B   0       2020-01
C   0       2020-01
D   1       2020-01
D   1       2020-02

I found a pandas solution here, which does what I am looking for but my df is to big for pandas. I did not manage to get this solution to work. It just created each month for each ID.

score 1 · Answer 1 · answered Jan 25 '22 at 15:12

You can do that with the following

from spark.sql import functions

df1 = df.select("id", "event").where(df["2020-01"] == 1).withColumn("month", functions.lit("2020-01"))
df2 = df.select("id", "event").where(df["2020-02"] == 1).withColumn("month", functions.lit("2020-02"))

df1 = df1.unionAll(df2).orderBy("id")
df1.show()
+---+-----+-------+
| id|event|  month|
+---+-----+-------+
|  A|    1|2020-02|
|  B|    0|2020-01|
|  C|    0|2020-01|
|  D|    1|2020-01|
|  D|    1|2020-02|
+---+-----+-------+

Tanks for your solution. It looks quiet manual to me, with a lot of code for 18 month. Is there any more elegant solution? — TiTo, Jan 25 '22 at 15:34

score 1 · Accepted Answer · answered Jan 27 '22 at 10:50

This may work if you get many columns:

out_list = [i for i in df.columns if i not in ['id','event']]

stack_str = ','.join(map(lambda x:'"{0}",`{0}`'.format(x),out_list))

(df.selectExpr('id','event',
               'stack({0},{1}) as 
               (feature,value)'.format(len(out_list),stack_str))
 .where(col('value')==1)
 .show()[enter image description here][1]
)

Get single column from dummy columns in PySpark

2 Answers2