How can I merge duplicate rows and fill the NaN cells with the values from the other row?

Question

I have a dataframe df:

df = pd.DataFrame(
{
    "type": ["E/2", "E/2", "E/2", "E/2"],
    "subtype1": ["N2", "N2", "N2", "N2"],
    "subtype2": ["a", "a", "b", np.nan],
    "subtype3": [np.nan, np.nan, np.nan, "xxx"],
    "flex_best": [20, np.nan, 20, np.nan],
    "flex_worst": [np.nan, 30, np.nan, 30],
    "lead_best": [23, np.nan, 23, np.nan],
    "is_best": [1, np.nan, 1, np.nan],
    "lead_worst": [np.nan, 33, np.nan, 33],
    "is_worst": [np.nan, 1, np.nan, 1],
}
)
df.head()

| type | subtype1 | subtype2 | subtype3 | flex_best | flex_worst | lead_best | is_best | lead_worst | is_worst|
|------|----------|----------|----------|-----------|------------|-----------|---------|------------|---------|
| E/2  | N2       | a        | NaN      | 20        | NaN        | 23        | 1       | NaN        | NaN     |
| E/2  | N2       | a        | NaN      | NaN       | 30         | NaN       | NaN     | 33         | 1       |
| E/2  | N2       | b        | NaN      | 20        | NaN        | 23        | 1       | NaN        | NaN     |
| E/3  | N2       | NaN      | xxx      | NaN       | 30         | NaN       | NaN     | 33         | 1       |

I want to remove duplicate rows and merge them together by: ["type", "subtype1", "subtype", "subtype3"] , filling the NaNs where suitable.

So:

"flex_best" + "flex_worst"
"lead_best" + "lead_worst"
"is_best" + "is_worst"

This should result in this dataframe:

| type | subtype1 | subtype2 | subtype3 | flex_best | flex_worst | lead_best | is_best | lead_worst | is_worst|
|------|----------|----------|----------|-----------|------------|-----------|---------|------------|---------|
| E/2  | N2       | a        | NaN      | 20        | 30         | 23        | 1       | 33         | 1       |
| E/2  | N2       | b        | NaN      | 20        | NaN        | 23        | 1       | NaN        | NaN     |
| E/2  | N2       | NaN      | xxx      | NaN       | 30         | NaN       | NaN     | 33         | 1       |

How can I do this with pandas?

ALollz · Accepted Answer · 2021-05-18T18:19:51.070

To do this we will use the justify function provided by @cs95 (credit there given to @Divakar) within a groupby. Since some of your grouping keys contain NaN we need to add dropna=False to the groupby call.

Then drop rows that are all NaN (on the subset of non-grouping columns) after.

import numpy as np
import pandas as pd

gp_cols = ['type', 'subtype1', 'subtype2', 'subtype3']
oth_cols = df.columns.difference(gp_cols)

arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False)
                  .apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, 
                                            axis=0, side='up')))

# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
         .dropna(how='all', subset=oth_cols, axis=0))

print(res)

  type subtype1 subtype2 subtype3 flex_best flex_worst lead_best is_best lead_worst is_worst
0  E/2       N2        a      NaN      20.0       30.0      23.0     1.0       33.0      1.0
2  E/2       N2        b      NaN      20.0        NaN      23.0     1.0        NaN      NaN
3  E/2       N2      NaN      xxx       NaN       30.0       NaN     NaN       33.0      1.0

Thank you for your solution, my original data has NaNs at some columns which seems to lead to way too many removed rows with your code. I added an extended data sample. Is there a way to do the same with NaNs in column "subtype3"? I could fill those with "" or anything else if necessary. — Vega, May 18 '21 at 17:29
Perfect, thanks a lot!!! I have 79 columns to group by, your code makes it so easy to apply (should probably wrote that with my problem description). — Vega, May 18 '21 at 20:10

How can I merge duplicate rows and fill the NaN cells with the values from the other row?

1 Answers1

Linked