I have a dataframe with a very large number of columns (>30000).
I'm filling it with 1
and 0
based on the first column like this:
for column in list_of_column_names:
df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))
However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.
Edit:
Sample input data
+----------------+-----+-----+-----+
| list_column | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] | | | |
| ['Bar', Baz'] | | | |
| ['Foo'] | | | |
+----------------+-----+-----+-----+