0

This is how I am reading and creating the dataframe with pandas

def get_sheet_data(sheet_name='SomeName'):
    df = pd.read_excel(f'{full_q_name}',
                       sheet_name=sheet_name,
                       header=[0,1],
                       index_col=0)#.fillna(method='ffill')
    df = df.swapaxes(axis1="index", axis2="columns")
    return df.set_index('Product Code')

printing this tabularized gives me(this potentially will have hundreds of columns): enter image description here

I cant seem to add those first two rows into the header, I've tried:

python:pandas - How to combine first two rows of pandas dataframe to dataframe header?

https://stackoverflow.com/questions/59837241/combine-first-row-and-header-with-pandas

and I'm failing at each point. I think its because of the multiindex, not necessarily the axis swap? But using:

https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html

is kind of going over my head right now. Please help me add those two rows into the header?

The output of df.columns is massive so Ive cut it down alot:

Index(['Product Code','Product Narrative\nHigh-level service description','Product Name','Huawei Product ID','Type','Bill Cycle Alignment',nan,'Stackable',nan,

and ends with:

nan], dtype='object')
Rafa
  • 564
  • 4
  • 12
imp
  • 435
  • 6
  • 20

1 Answers1

1

We Create new column names and set them to df.columns, the new column names are generated by joining the 3 Multindex headers and the 1st row of the DataFrame.

df.columns = ['_'.join(i) for i in zip(df.columns.get_level_values(0).tolist(), df.columns.get_level_values(1).tolist(), df.iloc[0,:].replace(np.nan,'').tolist())]
darth baba
  • 1,277
  • 5
  • 13
  • I attempted something similar before: df.columns = (df.columns.get_level_values(1)[:3].tolist() + df.columns.get_level_values(2)[3:].tolist()) ... But in both cases with your example above the error i get is: IndexError: Too many levels: Index has only 1 level, not 2 – imp Nov 22 '21 at 23:22
  • if the error says the Index has only 1 level then .get_level_values(0) and .get_level_values(1) should work, see the updated answer – darth baba Nov 23 '21 at 05:16
  • unfortunately I get the same error darth -->df = df.swapaxes(axis1="index", axis2="columns") is causing the error, if I comment the line out the process works but then obviously my data is not displaying correctly. And if I replace columns with index above so `df.index.get_level_values(0).tolist()`, i get an error of Length mismatch: Expected axis has 63 elements, new values have 33 elements – imp Nov 23 '21 at 06:09
  • i actually changed the whole line to be ` df.index = ['_'.join(i) for i in zip(df.index.get_level_values(0).tolist(), df.index.get_level_values(1).tolist(),df.iloc[0, :].replace(np.nan, '').tolist())]` and i got `IndexError: Too many levels: Index has only 1 level, not 2` – imp Nov 23 '21 at 06:29
  • why are you replacing df.columns with df.index when what you want is to rename columns – darth baba Nov 23 '21 at 07:01
  • literally playing around since i swapped the index i just wanted to see if it would work lol – imp Nov 23 '21 at 07:05
  • From your output to the df.columns, I don't think you have multiindex headers, so a simple ```['_'.join(i) for i in zip(df.columns, df.iloc[0,:].replace(np.nan,'').tolist())]``` should work – darth baba Nov 23 '21 at 07:11
  • If you had multiindex headers this [answer](https://stackoverflow.com/questions/14507794/pandas-how-to-flatten-a-hierarchical-index-in-columns) should help – darth baba Nov 23 '21 at 07:13
  • Well that got rid of the error and caused a new one: `TypeError: sequence item 0: expected str instance, float found` - will go through that link you sent me now (i havent read that post) and play with it further - thank you darth, will let you know the outcome as soon as i know :) – imp Nov 23 '21 at 07:17
  • 1
    This Type Error is caused because you have nan in the columns, replace those with any str or convert float NaNs to strings then try again it should work – darth baba Nov 23 '21 at 07:22
  • 1
    i simply added `df.columns = df.columns.astype(str)`, then your line `df.columns = ['_'.join(i) for i in zip(df.columns, df.iloc[0, :].replace(np.nan, '').tolist())]` and i'm returning the column names as expected :) thank you darth baba – imp Nov 23 '21 at 07:43