11

I would like to read an Excel sheet into Pandas DataFrame. However, there are merged Excel cells as well as Null rows (full/partial NaN filled), as shown below. To clarify, John H. has made an order to purchase all the albums from "The Bodyguard" to "Red Pill Blues".

Excel sheet capture

When I read this Excel sheet into a Pandas DataFrame, the Excel data does not get transferred correctly. Pandas considers a merged cell as one cell. The DataFrame looks like the following: (Note: Values in () are the desired values that I would like to have there)

Dataframe capture

Please note that the last row does not contain merged cells; it only carries a value for Artist column.


EDIT: I did try the following to forward-fill in the NaN values:(Pandas: Reading Excel with merged cells)
df.index = pd.Series(df.index).fillna(method='ffill')  

However, the NaN values remain. What strategy or method could I use to populate the DataFrame correctly? Is there a Pandas method of unmerging the cells and duplicating the corresponding contents?

CPU
  • 267
  • 1
  • 6
  • 16
  • 3
    Did you try anything? Can you show us your attempt? May this post can help: https://stackoverflow.com/questions/22937650/pandas-reading-excel-with-merged-cells – Vico Dec 15 '17 at 14:11
  • 1
    Possible duplicate of [Pandas: Reading Excel with merged cells](https://stackoverflow.com/questions/22937650/pandas-reading-excel-with-merged-cells) – John Y Dec 15 '17 at 14:21
  • 1
    @Vico - not just helpful, it's the exact same question! – John Y Dec 15 '17 at 14:21
  • @ Vico - I already tried the solution in the link, however, it does not work. The `NaN` values remain. – CPU Dec 15 '17 at 15:12

2 Answers2

11

The referenced link you attempted needed to forward fill only the index column. For your use case, you need to fillna for all dataframe columns. So, simply forward fill entire dataframe:

df = pd.read_excel("Input.xlsx")
print(df)

#    Order_ID Customer_name            Album_Name           Artist  Quantity
# 0       NaN           NaN            RadioShake              NaN       NaN
# 1       1.0       John H.         The Bodyguard  Whitney Houston       2.0
# 2       NaN           NaN              Lemonade          Beyonce       1.0
# 3       NaN           NaN  The Thrill Of It All        Sam Smith       2.0
# 4       NaN           NaN              Thriller  Michael Jackson      11.0
# 5       NaN           NaN                Divide       Ed Sheeran       4.0
# 6       NaN           NaN            Reputation     Taylor Swift       3.0
# 7       NaN           NaN        Red Pill Blues         Maroon 5       5.0

df = df.fillna(method='ffill')
print(df)

#    Order_ID Customer_name            Album_Name           Artist  Quantity
# 0       NaN           NaN            RadioShake              NaN       NaN
# 1       1.0       John H.         The Bodyguard  Whitney Houston       2.0
# 2       1.0       John H.              Lemonade          Beyonce       1.0
# 3       1.0       John H.  The Thrill Of It All        Sam Smith       2.0
# 4       1.0       John H.              Thriller  Michael Jackson      11.0
# 5       1.0       John H.                Divide       Ed Sheeran       4.0
# 6       1.0       John H.            Reputation     Taylor Swift       3.0
# 7       1.0       John H.        Red Pill Blues         Maroon 5       5.0
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

Using conditional:

import pandas as pd

df_excel = pd.ExcelFile('Sales.xlsx')
df = df_excel.parse('Info')

for col in list(df):  # All columns
    pprow = 0
    prow = 1
    for row in df[1:].iterrows():  # All rows, except first
        if pd.isnull(df.loc[prow, 'Album Name']):  # If this cell is empty all in the same row too.
            continue
        elif pd.isnull(df.loc[prow, col]) and pd.isnull(df.loc[row[0], col]):  # If a cell and next one are empty, take previous valor. 
            df.loc[prow, col] = df.loc[pprow, col]
        pprow = prow
        prow = row[0]

Output (I use different names):

    Order_ID Customer_name    Album Name
0        NaN           NaN         Radio
1        1.0          John            a 
2        1.0          John             b
3        1.0          John             c
4        1.0          John             d
5        1.0          John             e
6        1.0          John             f
7        NaN           NaN            GE
8        2.0         Harry   We are Born
9        3.0        Lizzy        Relapse
10       4.0           Abe         Smoke
11       4.0           Abe       Tell me
12       NaN           NaN           NaN
13       NaN           NaN      Best Buy
14       5.0        Kristy      The wall
15       6.0         Sammy  Kind of blue
Manuel
  • 698
  • 4
  • 8
  • @ Manuel - I tried your solution, but I get the following error: `KeyError: the label [-1] is not in the [index]`. This error is caused from the last line of your code. Shouldn't the last line be `df.loc[row[0], col] = df.loc[row[0]+1, col]`? It would be appreciative if you can include in-line comments in your answer to explain your logic. I am new to Python Pandas and would like to better understand your logic here. Thank you. – CPU Dec 15 '17 at 18:56
  • @ Manuel - How does your solution regard the last row of my Excel sheet (screenshot above) where it is a normal row with unmerged cells. It looks as though this row is merged with the previous row (`Order_ID 15`), but this last row is its own. – CPU Dec 15 '17 at 19:23
  • @ Manuel - I retried your solution, but I am getting the same error because my DataFrame has index labels that go like this: `2,3,4,5,6,7,8,9,11,12,...`. Thus, there is no row/index label `10`. I want to reset the index labels but for the project purposes, I cannot. How can I iterate a DataFrame (row-by-row) that has **non-sequential** index labels? – CPU Dec 15 '17 at 21:09
  • Now saves each row in 'prow', previous row and 'pprow' previous previous row. Now, there should be no problem even if the row numbers are not consecutive. – Manuel Dec 15 '17 at 22:05
  • @ Manuel - Your updated code partially works for me. I tried to implement your solution into a dictionary of DataFrames (`df_dict = {sheet_name: Dataframe, sheet_name: Dataframe,...}`). I had a combination of results: for some of the dataframes, the merged cells are correctly filled in entirely while some DF's are partially filled & some retain the merged cells. There should be way to do this in a dictionary of multiple DataFrames... – CPU Dec 15 '17 at 23:57
  • @Mook If all the dataframes have the same format, it should work. Test separately the cases that do not work. – Manuel Dec 16 '17 at 00:40