Efficient column processing in PySpark

Question

I have a dataframe with a very large number of columns (>30000).

I'm filling it with 1 and 0 based on the first column like this:

for column in list_of_column_names:
  df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))

However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.

Edit:

Sample input data

+----------------+-----+-----+-----+
|  list_column   | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] |     |     |     |
| ['Bar', Baz']  |     |     |     |
| ['Foo']        |     |     |     |
+----------------+-----+-----+-----+

can you provide with sample input data and explain a little bit more ? — Ramesh Maharjan, May 18 '18 at 15:13
@RameshMaharjan added sample input data. Currently spark processes each column sequentially, which is slow. My question is about making this processing faster. — Kertis van Kertis, May 18 '18 at 15:21
Spark handles tall tables (lots of rows) really well, but it's not as great with wide tables (lots of columns). — pault, May 18 '18 at 15:46

score 4 · Answer 1 · answered May 18 '18 at 15:21

There is nothing specifically wrong with your code, other than very wide data:

for column in list_of_column_names:
    df = df.withColumn(...)

only generates the execution plan.

Actual data processing will concurrent and parallelized, once the result is evaluated.

It is however an expensive process as it require O(NMK) operations with N rows, M columns and K values in the list.

Additionally execution plans on very wide data are very expensive to compute (though cost is constant in terms of number of records). If it becomes a limiting factor, you might be better off with RDDs:

Sort column array using sort_array function.
Convert data to RDD.
Apply search for each column using binary search.

score 2 · Accepted Answer · answered May 18 '18 at 15:42

2

You might approach like this,

import pyspark.sql.functions as F

exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
                  for column in list_column_names]

df = df.select(['list_column']+exprs)

answered May 18 '18 at 15:42

mayank agrawal

2,495
2
13
32

score 2 · Answer 3 · answered May 18 '18 at 15:48

withColumn is already distributed so a faster approach would be difficult to get other than what you already have. you can try defining a udf function as following

from pyspark.sql import functions as f
from pyspark.sql import types as t

def containsUdf(listColumn):
    row = {}
    for column in list_of_column_names:
        if(column in listColumn):
            row.update({column: 1})
        else:
            row.update({column: 0})
    return row

callContainsUdf = f.udf(containsUdf, t.StructType([t.StructField(x, t.StringType(), True) for x in list_of_column_names]))

df.withColumn('struct', callContainsUdf(df['list_column']))\
    .select(f.col('list_column'), f.col('struct.*'))\
    .show(truncate=False)

which should give you

+-----------+---+---+---+
|list_column|Foo|Bar|Baz|
+-----------+---+---+---+
|[Foo, Bak] |1  |0  |0  |
|[Bar, Baz] |0  |1  |1  |
|[Foo]      |1  |0  |0  |
+-----------+---+---+---+

Note: list_of_column_names = ["Foo","Bar","Baz"]

Efficient column processing in PySpark

3 Answers3

Linked