I've prepared a code to read a huge CSV file (50gb) (500krecords with 30k columns) in chunks of 1000 records. There is a transformation involved to transpose the data into key value pairs. This is currently working in a sequential order. Is there a way to run those chunks in parallel? Will PySpark help here? The final file will be moved to Redshift.
obj = s3.get_object(Bucket='bucket', Key = 'path')
csv_header = pd.read_csv(obj['Body'], nrows=1).columns
obj = s3.get_object(Bucket='bucket', Key = 'path')
csv_iterator = pd.read_csv(obj['Body'], iterator=True, chunksize=1000,low_memory=False)
i=0
c=0
print("Iteration Start!",time.strftime("%H:%M:%S", time.localtime()))
for csv_chunk in csv_iterator:
column_size = csv_header.size-csv_chunk.columns.size
csv_chunk[np.arange(column_size)]=None
chunk = pd.DataFrame(csv_chunk.values,columns=csv_header)
if i==0:
chunk.to_csv('temp_source.csv',mode = 'a')
else:
chunk.to_csv('temp_source.csv',mode = 'a',header=False)
out = pd.melt(chunk,
id_vars=['eid'],
value_vars=chunk.columns[1:])
if i==0:
out.to_csv('temp_key_value.csv',mode='a')
else:
out.to_csv('temp_key_value.csv',mode='a',header=False)
i=i+1
c=c + csv_chunk.shape[0]
print("Iteration ",i," Chunk - ",c," - ",time.strftime("%H:%M:%S", time.localtime()))
print("Load Complete!",time.strftime("%H:%M:%S", time.localtime()))
Trying to run read csv and transformations in parallel