Pyspark - iterate on a big dataframe

Question

I'm using the following code

    events_df = []
    for i in df.collect():
        v = generate_event(i)
        events_df.append(v)
  
    events_df = spark.createDataFrame(events_df, schema)

to go over each dataframe item and add an event header calculated in the generate_event function

def generate_event(delta_row):
    
    header = {
        "id": 1,
        ...
    }

    row = Row(Data=delta_row)
    return EntityEvent(header, row)


class EntityEvent:
    def __init__(self, _header, _payload):
        self.header = _header
        self.payload = _payload

It works fine locally for df with few items (even with 1 000 000 items) but when we have more than 6 millions the aws glue job fail

Note: with rdd seems to be better but I can't use it because I've a problem with dates < 1900-01-01 (issue)

is there a way to chunk the dataframe and consolidate at the end ?

what does `generate_event()` do exactly? there might be an easier way of doing that. — samkart, May 13 '22 at 06:23

Smaillns · Answer 1 · 2022-05-31T12:45:03.760

The best solution that we can preview is to use spark promise features, like adding new columns using struct and create_map functions...

events_df = (
        df
        .withColumn(
            "header",
            f.create_map(
                f.lit("id"),
                f.lit(1)
            )
        )
...

So we can create columns as much as we need and make transformations to get the required header structure

PS: this solution (add new columns to the dataframe rather than iterate on it) avoid using rdd and brings a big advantage in terms of performance !

Pyspark - iterate on a big dataframe

1 Answers1