Reading excel files in pandas

Question

I am trying to read an Excel file using pandas but I am not sure if I am able to read the way I need.

My file is like this:

I am reading the file like this:

excel_file = pd.ExcelFile('MY_FILE')
df = excel_file.parse(sheet_name=0, header=1)

This way I am able to read but I am not sure from which group each variable belongs. In this case, for each column I need to know from which group they are coming from. Is there any way to do this?

Thank you!

jezrael · Accepted Answer · 2019-08-01T13:54:00.523

Here is possible specify first and second row in parameter header for MultiIndex in columns and index_col for index from first column in function read_excel:

df = pd.read_excel('file.xlsx', header=[0,1], index_col=[0], sheet_name=0)

Your solution should be changed with same parameters:

excel_file = pd.ExcelFile('file.xlsx')
df = excel_file.parse(header=[0,1], index_col=[0], sheet_name=0)

print (df)
CUSTOM NAME   g1      g2          
NAME           A    B  A    B    C
NAME 1       1.0  NaN  1  NaN  1.0
NAME 1       NaN  1.0  1  1.0  NaN

print (df.columns)
MultiIndex(levels=[['g1', 'g2'], ['A', 'B', 'C']],
           codes=[[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]],
           names=['CUSTOM NAME', 'NAME'])

print (df.index)
Index(['NAME 1', 'NAME 1'], dtype='object')

Filtering working with tuples for select columns of MultiIndex:

print (df[df[('g1', 'A')] == 1])
CUSTOM NAME   g1     g2         
NAME           A   B  A   B    C
NAME 1       1.0 NaN  1 NaN  1.0

More information in Select rows in pandas MultiIndex DataFrame, only remove loc, because MultiIndex in columns.

in this case do you know how could I filter for example for G1.A when value is equals to 1.0? Super thanks! — briba, Aug 01 '19 at 13:51

PythonSherpa · Answer 2 · 2019-08-01T13:36:32.293

2

You can use MultiIndex if you pass a list of integers to header:

excel_file = pd.ExcelFile('example.xlsx')
df = excel_file.parse(sheet_name=0, header=[0,1])

the dataframe:

CUSTOM NAME     GROUP 1     GROUP 2
NAME            A   B       A   B   C
NAME 1          1.0 NaN     1   NaN 1.0
NAME 2          NaN 1.0     1   1.0 NaN

Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

edited Aug 01 '19 at 13:36

answered Aug 01 '19 at 13:29

PythonSherpa

2,560
3
19
40

What is your pandas version? Because in pandas 0.24.2 is necessary solution like in my answer, your solution create MultiIndex also from first column, instead index. – jezrael Aug 01 '19 at 13:35

Valdi_Bo · Answer 3 · 2019-08-01T13:51:48.787

Reading column names only from row 1 will probably lead to repeating column names, so such a call can end up with execution error.

It looks like you have a case of MultiIndex on columns:

GROUP 1 and GROUP 2 is the top level,
A, B and C is the subordinate level.

When reading Excel files, I usually use pd.read_excel method.

In your case, you probably should pass header=[0,1] parameter, specifying which rows contain column names.

So try something like:

df = pd.read_excel('MY_FILE.xls', header=[0,1])

You can also omit sheet_name=0, as this is the default value (read from the first sheet, sheet numbering is from 0).

I'm also not sure whether you can omit file extension (either xls or xlsx), so in the above proposal I wrote it explicitely.

Reading excel files in pandas

3 Answers3