0

I have a piece of code trying to merge cols. with duplicated cols names in a pandas structure, basically I am trying to do the same thing like this post:

Python Pandas merge samed name columns in a dataframe

However, the DataFrame I am trying to process, is loaded from a CSV which is around 1GB, and has around 2600 columns & 27000+ rows.

The code runs, but it runs for ~2hr20min.

Out of 2600 columns, only ~30 of cols needs to be merged into 4 cols, say from 13th~42nd.

Is there a way to optimize the code mentioned in the linked posts? Perhaps find a way to tell Pandas just to GroupBy cols from 13th~42nd, and join only the fields in this area.

Greatly appreciated.

Paul
  • 89
  • 1
  • 2
  • 1
    The user guide show how to slice columns with labels or list-type integer indexing using `.loc[]` and `.iloc[]` https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-label – RichieV Jul 30 '20 at 07:06
  • 1
    Take a look at `.melt()` https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-by-melt after melting target columns into variable & value columns and `.dropna(subset=['value'])` you can `.groupby(['variable']).apply(';'.join)` as in this answer https://stackoverflow.com/a/27298308/6692898 – RichieV Jul 30 '20 at 07:18

1 Answers1

0

Thanks to RichieV for both tips. I ended up using both tips and I ended up doing this:

1st section out your dataset into 2 parts by cols, I did it at 25 because that's where the repeating cols with NaN values end:

    sec1 = filtered_data.iloc[:,:25]
    sec2 = filtered_data.iloc[:,25:]

Then you melt the 1st part of the data-frame into into variable & value cols while preserving header_cols, this is faster than melting the whole data-frame:

    sec1 = sec1.melt(id_vars=header_cols)

Next you just dropna for any rows that has empty values in any cols, mainly will be in value col:

    sec1 = sec1.dropna(axis='index', how='any') #remove rows with empty fields

Finally pivot it back and join the 2 split data-frames:

    sec1 = sec1.pivot_table(index= header_cols,
                   columns='variable',
                   values='value',
                   aggfunc='first')
    sec1 = sec1.reset_index()
    sec1 = sec1.rename_axis(None, axis=1)
    filtered_data = p.concat([sec1, sec2], axis=1)

Thanks RichieV once again for the great tips!

Paul
  • 89
  • 1
  • 2