I have a 50gb csv file(500k records, 30k columns) that I need to transpose to key value pairs. I planned to iterate through the file in chunks and for each chunk I would take a transpose and append to my target csv file. While iterating through each chunk, I need to make sure each chunk has all the column names so that the transpose works correctly. But the dataframe for each chunk varies in size(column count vary). I have been trying to fetch the column names and add it to each dataframe chunk after padding empty columns to it to make sure the size is consistent. Tried to concat dummy dataframe but got an error - NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
Any suggestions to resolve this?
obj = s3.get_object(Bucket='bucket', Key = 'xyz.csv')
csv_header = pd.read_csv(obj['Body'], nrows=1).columns
print(csv_header.size)
csv_iterator = pd.read_csv(obj['Body'], iterator=True, chunksize=1)
for csv_chunk in csv_iterator:
#chunk =csv_chunk.set_axis(csv_header,
# axis=1,inplace=True)
#chunk.to_csv('temp_raw.csv',mode = 'a')
#chunk.columns(csv_header)
#chunk = pd.concat([csv_chunk,csv_header],axis=1)
#print(csv_chunk)
#csv_chunk.columns = csv_header
column_size = csv_header.size-csv_chunk.columns.size
print(csv_chunk.columns.size)
print(csv_chunk.shape)
print(column_size)
a=[]
b=[]
for j in range(0, column_size):
b.append(None)
#Append the column to each row.
for i in range(0, 1):
a.append(b)
print(pd.DataFrame(a).shape)
df=pd.DataFrame(csv_chunk)
a=pd.DataFrame(a)
print(df.index)
print(a.index)
chunk=pd.concat([df,a],axis = 1)
# df = pd.concat([pd.DataFrame(csv_chunk),
# pd.DataFrame(a)],axis =1)
# chunk = pd.DataFrame(pd.concat([pd.DataFrame(csv_chunk),
# pd.DataFrame(a)]
# ,axis=1),columns=csv_header)
#print(pd.DataFrame(a).size)
# print(chunk.head(1))
# print(csv_chunk.columns.size)
chunk.to_csv('temp_source.csv',mode = 'a')
out = pd.melt(chunk,
id_vars=['eid'],
value_vars=chunk.columns[1:])
out.to_csv('temp_key_value.csv',mode='a')
Tried to concat dummy dataframe but got an error - NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
Any suggestions to resolve this?