1

I've prepared a code to read a huge CSV file (50gb) (500krecords with 30k columns) in chunks of 1000 records. There is a transformation involved to transpose the data into key value pairs. This is currently working in a sequential order. Is there a way to run those chunks in parallel? Will PySpark help here? The final file will be moved to Redshift.

obj = s3.get_object(Bucket='bucket', Key = 'path')
csv_header =  pd.read_csv(obj['Body'], nrows=1).columns

obj = s3.get_object(Bucket='bucket', Key = 'path')
csv_iterator = pd.read_csv(obj['Body'], iterator=True, chunksize=1000,low_memory=False) 

i=0
c=0

print("Iteration Start!",time.strftime("%H:%M:%S", time.localtime()))
for csv_chunk in csv_iterator:
    column_size = csv_header.size-csv_chunk.columns.size   
    csv_chunk[np.arange(column_size)]=None 
    chunk = pd.DataFrame(csv_chunk.values,columns=csv_header)
    if i==0:
        chunk.to_csv('temp_source.csv',mode = 'a')
    else:
        chunk.to_csv('temp_source.csv',mode = 'a',header=False)
    out = pd.melt(chunk,
              id_vars=['eid'],
              value_vars=chunk.columns[1:])
    if i==0:
        out.to_csv('temp_key_value.csv',mode='a')
    else:
        out.to_csv('temp_key_value.csv',mode='a',header=False)
    i=i+1
    c=c + csv_chunk.shape[0]
    print("Iteration ",i," Chunk - ",c," - ",time.strftime("%H:%M:%S", time.localtime()))
    
print("Load Complete!",time.strftime("%H:%M:%S", time.localtime()))

Trying to run read csv and transformations in parallel

  • can you share a 1000-lines csv actual input file for testing (a link)? – RomanPerekhrest Sep 01 '23 at 12:55
  • I'm afraid I can't share the file. The dimension of the file is 500k x 30k – Will Graham Sep 01 '23 at 13:26
  • This is a follow-up from https://stackoverflow.com/q/77017050/12846804 – OCa Sep 01 '23 at 13:28
  • "Trying to run read csv and transformations in parallel" Why? Might transformation not be done more efficiently in a post-processing stage, using vectorization? – OCa Sep 01 '23 at 13:30
  • @WillGraham, can you just `pd.read_csv(obj['Body'], nrows=1000).to_csv('test.csv')` for that? – RomanPerekhrest Sep 01 '23 at 13:30
  • `if i==0:` (...) `if i==0:` Do you have the exact same block twice? – OCa Sep 01 '23 at 13:31
  • I would probably use `split` to split the original CSV file to 1000-line chunks, then run the program over them with e.g. `parallel`... – AKX Sep 01 '23 at 13:32
  • (However, all in all this sounds like a "you don't need Pandas" problem – look at the `csv` module since you're essentially working with streaming...) – AKX Sep 01 '23 at 13:33
  • 1
    Not exactly focused on your question but a downstream issue. Reading a single 50gb file into Redshift will be suboptimal (slow). Redshift can read CSV files from S3 in parallel and greatly improve load time. At least 1 file per slice in the cluster and at least 250mb per file. 50 1gb files would be great. – Bill Weiner Sep 01 '23 at 14:24

0 Answers0