0

I have a 50gb csv file(500k records, 30k columns) that I need to transpose to key value pairs. I planned to iterate through the file in chunks and for each chunk I would take a transpose and append to my target csv file. While iterating through each chunk, I need to make sure each chunk has all the column names so that the transpose works correctly. But the dataframe for each chunk varies in size(column count vary). I have been trying to fetch the column names and add it to each dataframe chunk after padding empty columns to it to make sure the size is consistent. Tried to concat dummy dataframe but got an error - NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.

Any suggestions to resolve this?


obj = s3.get_object(Bucket='bucket', Key = 'xyz.csv')
csv_header =  pd.read_csv(obj['Body'], nrows=1).columns
print(csv_header.size)
csv_iterator = pd.read_csv(obj['Body'], iterator=True, chunksize=1) 

for csv_chunk in csv_iterator:
    #chunk =csv_chunk.set_axis(csv_header,
    #                 axis=1,inplace=True)
    #chunk.to_csv('temp_raw.csv',mode = 'a')
    #chunk.columns(csv_header)
    #chunk = pd.concat([csv_chunk,csv_header],axis=1)
    #print(csv_chunk)
    #csv_chunk.columns = csv_header
    column_size = csv_header.size-csv_chunk.columns.size
    print(csv_chunk.columns.size)
    print(csv_chunk.shape)
    print(column_size)
    a=[]
    b=[]
    for j in range(0, column_size):
        b.append(None)
    #Append the column to each row.
    for i in range(0, 1):
        a.append(b)
    print(pd.DataFrame(a).shape)
    df=pd.DataFrame(csv_chunk)
    a=pd.DataFrame(a)
    print(df.index)
    print(a.index)
    chunk=pd.concat([df,a],axis = 1)
#     df = pd.concat([pd.DataFrame(csv_chunk),
#                                  pd.DataFrame(a)],axis =1)
#     chunk = pd.DataFrame(pd.concat([pd.DataFrame(csv_chunk),
#                                  pd.DataFrame(a)]
#                                 ,axis=1),columns=csv_header)
    #print(pd.DataFrame(a).size)
#     print(chunk.head(1))
#     print(csv_chunk.columns.size)
    chunk.to_csv('temp_source.csv',mode = 'a')
    out = pd.melt(chunk,
              id_vars=['eid'],
              value_vars=chunk.columns[1:])
    out.to_csv('temp_key_value.csv',mode='a')

Tried to concat dummy dataframe but got an error - NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.

Any suggestions to resolve this?

  • Welcome to SO! For accurate responses, please take the [tour](https://stackoverflow.com/tour). Importantly, see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). A few representative chunk lines should help. – OCa Aug 31 '23 at 16:18

1 Answers1

0

Strictly about your question's title,

df[np.arange(4)]=None will add N=4 columns to a pre-existing df dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(columns = ['one','two'],
                  index = range(1))
df
   one  two
0  NaN  NaN

df[np.arange(4)]=None
df
   one  two     0     1     2     3
0  NaN  NaN  None  None  None  None

You might find here other ways to your liking: How to add a new column to an existing DataFrame?

Now where your transpose and multi-index factor in, is unclear given the lack of MRVE in your question. Reading from a csv,

  • it isn't clear how you end up with a multi-index, which looks like the cause for your difficulty. One that might be removed by not generating a multi-index in the first place? But we'll need to see a representative sample of input data.
  • It is also unclear how number of columns vary in between chunks, when reading form the same csv. Is this a real csv, or a text file that only looks like one?
OCa
  • 298
  • 2
  • 13