I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function. I can use SparkContext to have a workaround like this:
csv_file_rdd = sc.textFile(csv_file).collect()
count = 0
buffer = []
while count < len(csv_file_rdd):
buffer.append(csv_file_rdd[count])
count += 1
if count % 100 == 0 or count == len(csv_file_rdd):
# Send buffer to process
print("Send:", buffer)
# Clear buffer
buffer = []
but there must be a more elegant solution. I've tried using SparkSession
and mapPartition
but I haven't been able to make it work.