I'm using the following code
events_df = []
for i in df.collect():
v = generate_event(i)
events_df.append(v)
events_df = spark.createDataFrame(events_df, schema)
to go over each dataframe item and add an event header calculated in the generate_event
function
def generate_event(delta_row):
header = {
"id": 1,
...
}
row = Row(Data=delta_row)
return EntityEvent(header, row)
class EntityEvent:
def __init__(self, _header, _payload):
self.header = _header
self.payload = _payload
It works fine locally for df
with few items (even with 1 000 000 items) but when we have more than 6 millions the aws glue job fail
Note: with rdd
seems to be better but I can't use it because I've a problem with dates < 1900-01-01 (issue)
is there a way to chunk the dataframe and consolidate at the end ?