0

I have a dataframe df:

df = pd.DataFrame(
{
    "type": ["E/2", "E/2", "E/2", "E/2"],
    "subtype1": ["N2", "N2", "N2", "N2"],
    "subtype2": ["a", "a", "b", np.nan],
    "subtype3": [np.nan, np.nan, np.nan, "xxx"],
    "flex_best": [20, np.nan, 20, np.nan],
    "flex_worst": [np.nan, 30, np.nan, 30],
    "lead_best": [23, np.nan, 23, np.nan],
    "is_best": [1, np.nan, 1, np.nan],
    "lead_worst": [np.nan, 33, np.nan, 33],
    "is_worst": [np.nan, 1, np.nan, 1],
}
)
df.head()

| type | subtype1 | subtype2 | subtype3 | flex_best | flex_worst | lead_best | is_best | lead_worst | is_worst|
|------|----------|----------|----------|-----------|------------|-----------|---------|------------|---------|
| E/2  | N2       | a        | NaN      | 20        | NaN        | 23        | 1       | NaN        | NaN     |
| E/2  | N2       | a        | NaN      | NaN       | 30         | NaN       | NaN     | 33         | 1       |
| E/2  | N2       | b        | NaN      | 20        | NaN        | 23        | 1       | NaN        | NaN     |
| E/3  | N2       | NaN      | xxx      | NaN       | 30         | NaN       | NaN     | 33         | 1       |

I want to remove duplicate rows and merge them together by: ["type", "subtype1", "subtype", "subtype3"] , filling the NaNs where suitable.

So:

  • "flex_best" + "flex_worst"
  • "lead_best" + "lead_worst"
  • "is_best" + "is_worst"

This should result in this dataframe:

| type | subtype1 | subtype2 | subtype3 | flex_best | flex_worst | lead_best | is_best | lead_worst | is_worst|
|------|----------|----------|----------|-----------|------------|-----------|---------|------------|---------|
| E/2  | N2       | a        | NaN      | 20        | 30         | 23        | 1       | 33         | 1       |
| E/2  | N2       | b        | NaN      | 20        | NaN        | 23        | 1       | NaN        | NaN     |
| E/2  | N2       | NaN      | xxx      | NaN       | 30         | NaN       | NaN     | 33         | 1       |

How can I do this with pandas?

Vega
  • 2,661
  • 5
  • 24
  • 49

1 Answers1

4

To do this we will use the justify function provided by @cs95 (credit there given to @Divakar) within a groupby. Since some of your grouping keys contain NaN we need to add dropna=False to the groupby call.

Then drop rows that are all NaN (on the subset of non-grouping columns) after.

import numpy as np
import pandas as pd

gp_cols = ['type', 'subtype1', 'subtype2', 'subtype3']
oth_cols = df.columns.difference(gp_cols)

arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False)
                  .apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, 
                                            axis=0, side='up')))

# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
         .dropna(how='all', subset=oth_cols, axis=0))

print(res)

  type subtype1 subtype2 subtype3 flex_best flex_worst lead_best is_best lead_worst is_worst
0  E/2       N2        a      NaN      20.0       30.0      23.0     1.0       33.0      1.0
2  E/2       N2        b      NaN      20.0        NaN      23.0     1.0        NaN      NaN
3  E/2       N2      NaN      xxx       NaN       30.0       NaN     NaN       33.0      1.0
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Thank you for your solution, my original data has NaNs at some columns which seems to lead to way too many removed rows with your code. I added an extended data sample. Is there a way to do the same with NaNs in column "subtype3"? I could fill those with "" or anything else if necessary. – Vega May 18 '21 at 17:29
  • 1
    Perfect, thanks a lot!!! I have 79 columns to group by, your code makes it so easy to apply (should probably wrote that with my problem description). – Vega May 18 '21 at 20:10