59

I have Excel files with multiple sheets, each of which looks a little like this (but much longer):

        Sample  CD4     CD8
Day 1   8311    17.3    6.44
        8312    13.6    3.50
        8321    19.8    5.88
        8322    13.5    4.09
Day 2   8311    16.0    4.92
        8312    5.67    2.28
        8321    13.0    4.34
        8322    10.6    1.95

The first column is actually four cells merged vertically.

When I read this using pandas.read_excel, I get a DataFrame that looks like this:

       Sample    CD4   CD8
Day 1    8311  17.30  6.44
NaN      8312  13.60  3.50
NaN      8321  19.80  5.88
NaN      8322  13.50  4.09
Day 2    8311  16.00  4.92
NaN      8312   5.67  2.28
NaN      8321  13.00  4.34
NaN      8322  10.60  1.95

How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)

iayork
  • 6,420
  • 8
  • 44
  • 49

3 Answers3

78

You could use the Series.fillna method to forword-fill in the NaN values:

df.index = pd.Series(df.index).fillna(method='ffill')

For example,

In [42]: df
Out[42]: 
       Sample    CD4   CD8
Day 1    8311  17.30  6.44
NaN      8312  13.60  3.50
NaN      8321  19.80  5.88
NaN      8322  13.50  4.09
Day 2    8311  16.00  4.92
NaN      8312   5.67  2.28
NaN      8321  13.00  4.34
NaN      8322  10.60  1.95

[8 rows x 3 columns]

In [43]: df.index = pd.Series(df.index).fillna(method='ffill')

In [44]: df
Out[44]: 
       Sample    CD4   CD8
Day 1    8311  17.30  6.44
Day 1    8312  13.60  3.50
Day 1    8321  19.80  5.88
Day 1    8322  13.50  4.09
Day 2    8311  16.00  4.92
Day 2    8312   5.67  2.28
Day 2    8321  13.00  4.34
Day 2    8322  10.60  1.95

[8 rows x 3 columns]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    How would you solve the same problem for merged columns instead of rows? – Samarth Bharadwaj Dec 11 '14 at 09:42
  • 9
    @SamarthBharadwaj: The [`fillna` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) has an `axis` parameter which controls the direction to be filled. To fill all the NaNs in a DataFrame row-wise, you could use `df = df.fillna(method='ffill', axis=1)`. To fill only selected rows, use `df.loc` or `df.iloc`. For example, `df.loc[mask] = df.loc[mask].fillna(method='ffill', axis=1)`. – unutbu Dec 11 '14 at 09:48
  • @unutbu thx, but my question is slightly different, expressed here: http://stackoverflow.com/questions/27420263/pandas-parse-merged-header-columns-from-excel – Samarth Bharadwaj Dec 11 '14 at 10:15
  • also worked when the problematic column was not the index – wander95 Jan 29 '19 at 20:52
  • 11
    `fillna` with `ffill` is ok as long as a merged cell is not followed by a volontarily empty cell... – PlasmaBinturong Feb 24 '19 at 13:30
13
df = df.fillna(method='ffill', axis=0)  # resolved updating the missing row entries
Tim
  • 2,510
  • 1
  • 22
  • 26
Muth
  • 131
  • 1
  • 2
  • 2
    Code-only answers are generally frowned upon on Stack Overflow. In order to avoid being closed as 'low quality', please add some explanatory text. – Adrian Mole Oct 19 '19 at 19:00
12

To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.

df = pd.read_excel('path_to_file.xlsx', index_col=[0])

Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.

Nathan Pyle
  • 481
  • 6
  • 9