Parallelize data ingestion for reading csv file using Python/Pandas

Question

I've prepared a code to read a huge CSV file (50gb) (500krecords with 30k columns) in chunks of 1000 records. There is a transformation involved to transpose the data into key value pairs. This is currently working in a sequential order. Is there a way to run those chunks in parallel? Will PySpark help here? The final file will be moved to Redshift.

obj = s3.get_object(Bucket='bucket', Key = 'path')
csv_header =  pd.read_csv(obj['Body'], nrows=1).columns

obj = s3.get_object(Bucket='bucket', Key = 'path')
csv_iterator = pd.read_csv(obj['Body'], iterator=True, chunksize=1000,low_memory=False) 

i=0
c=0

print("Iteration Start!",time.strftime("%H:%M:%S", time.localtime()))
for csv_chunk in csv_iterator:
    column_size = csv_header.size-csv_chunk.columns.size   
    csv_chunk[np.arange(column_size)]=None 
    chunk = pd.DataFrame(csv_chunk.values,columns=csv_header)
    if i==0:
        chunk.to_csv('temp_source.csv',mode = 'a')
    else:
        chunk.to_csv('temp_source.csv',mode = 'a',header=False)
    out = pd.melt(chunk,
              id_vars=['eid'],
              value_vars=chunk.columns[1:])
    if i==0:
        out.to_csv('temp_key_value.csv',mode='a')
    else:
        out.to_csv('temp_key_value.csv',mode='a',header=False)
    i=i+1
    c=c + csv_chunk.shape[0]
    print("Iteration ",i," Chunk - ",c," - ",time.strftime("%H:%M:%S", time.localtime()))
    
print("Load Complete!",time.strftime("%H:%M:%S", time.localtime()))

Trying to run read csv and transformations in parallel

can you share a 1000-lines csv actual input file for testing (a link)? — RomanPerekhrest, Sep 01 '23 at 12:55
I'm afraid I can't share the file. The dimension of the file is 500k x 30k — Will Graham, Sep 01 '23 at 13:26
This is a follow-up from https://stackoverflow.com/q/77017050/12846804 — OCa, Sep 01 '23 at 13:28
"Trying to run read csv and transformations in parallel" Why? Might transformation not be done more efficiently in a post-processing stage, using vectorization? — OCa, Sep 01 '23 at 13:30
@WillGraham, can you just `pd.read_csv(obj['Body'], nrows=1000).to_csv('test.csv')` for that? — RomanPerekhrest, Sep 01 '23 at 13:30
`if i==0:` (...) `if i==0:` Do you have the exact same block twice? — OCa, Sep 01 '23 at 13:31
I would probably use `split` to split the original CSV file to 1000-line chunks, then run the program over them with e.g. `parallel`... — AKX, Sep 01 '23 at 13:32
(However, all in all this sounds like a "you don't need Pandas" problem – look at the `csv` module since you're essentially working with streaming...) — AKX, Sep 01 '23 at 13:33
Not exactly focused on your question but a downstream issue. Reading a single 50gb file into Redshift will be suboptimal (slow). Redshift can read CSV files from S3 in parallel and greatly improve load time. At least 1 file per slice in the cluster and at least 250mb per file. 50 1gb files would be great. — Bill Weiner, Sep 01 '23 at 14:24

Parallelize data ingestion for reading csv file using Python/Pandas

0 Answers0