Use pyspark to partition 100 rows from csv file

Question

I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function. I can use SparkContext to have a workaround like this:

csv_file_rdd = sc.textFile(csv_file).collect()

count = 0
buffer = []
while count < len(csv_file_rdd):
    buffer.append(csv_file_rdd[count])
    count += 1
    if count % 100 == 0 or count == len(csv_file_rdd):
        # Send buffer to process
        print("Send:", buffer)
        # Clear buffer
        buffer = []

but there must be a more elegant solution. I've tried using SparkSession and mapPartition but I haven't been able to make it work.

This isn't any more than a workaround than using Python to directly read the file with `open(csv_file).readlines()` — OneCricketeer, Sep 30 '21 at 20:55
I think you're right. But your comment doesn't answer my question. — fri6aug, Oct 01 '21 at 14:00

score 1 · Answer 1 · answered Sep 30 '21 at 20:10

I suppose that your current data is not partitioned in any way (I mean its only one file), so iterating over it sequencially is a must. I suggest to load it as a data frame spark.read.csv(csv_file) then repartition as in this question and save to disk. Once it's saved you'll have a big number of files containing the specified number of records (100 in your case), taht can be used by other program to send to a Lambda (probably with a Pool of workers). See this post to get an idea. Probably is a naive idea but get's the job done.

Use pyspark to partition 100 rows from csv file

1 Answers1