Merge each group's rows into one row

Question

I'm experienced with Pandas but stumbled upon a problem that I can't seem to figure out.

I have a large dataset ((40,000, 16)) and I am trying to group it by a specific column ("group_name" for this matter) and then for each group apply the following rules so it'd merge into one row per group:

x1, x2, x3 are the "important" columns, if one row has less nulls than the others, take it. (see example with row D)
If there are conflicts in any column, it's arbitrary and we can pick whatever.
Combine the nulls on the important fields (x1, x2, x3), see example with row A.

Here is an example with 6 rows that should turn into 4 groups (aka 4 rows).

So far I have groups = df.groupby['group_name']

I tried many other solutions such as summing each group, applying a transformation, aggregating by each 'important' column, merging on each 'important' column and more. Each solution brought it's own problems so I'm offering this question here without limiting people to a certain way.

Also, I spent nearly two days combining different solutions from other questions but none has seem to work. Perhaps I've missed something.

Please note that since this is a large dataset, I'd very much like to avoid using for loop on each group since efficiency is something to consider here.

I hope I explained everything properly, please let me know if something is unclear.

Code to re-create the dataframe (thanks to @Henry Ecker from the first answer):

df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', None, None, 'abc', 'xx', None],
    'x2': [None, 'b', None, 'def', 'yy', None],
    'x3': [None, None, None, None, 'zz', 'ff']
})

Does this answer your question? [pandas-merging-101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — Anurag Dabas, Jun 13 '21 at 16:17
Please provide the expected [MRE - Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Show where the intermediate results deviate from the ones you expect. We should be able to paste a single block of your code into file, run it, and reproduce your problem. This also lets us test any suggestions in your context. [Include your minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of the example. — Prune, Jun 13 '21 at 16:36
Off-site links and images of text are not acceptable; your posting must be self-contained, in keeping with the purpose of this site. — Prune, Jun 13 '21 at 16:37
@Prune sorry I wasn't aware that a picture is not enough, Henry Ecker already did that in his answer so I added that in the question, apologies. — OmerM25, Jun 14 '21 at 09:20
@AnuragDabas I tried mixing several solutions there so it'd fit this specific problem but I couldn't find anything. — OmerM25, Jun 14 '21 at 09:21

Henry Ecker · Accepted Answer · 2021-06-13T16:35:00.893

Try with groupby aggregate 'first' to get the first (valid) value from every column for each group_name:

new_df = df.groupby('group_name', as_index=False).agg('first')

new_df:

  group_name       z1        z2        z3      zN    x1    x2    x3
0          A   value1    value2    value3  valueN     a     b  None
1          B   value1    value2    value3  valueN  None  None  None
2          C   value1    value2    value3  valueN   abc   def  None
3          D  value99  value100  value101  valueN    xx    yy    zz

*Note if those are string 'null' mask them out first + fillna to put them back:

new_df = (
    df.mask(df.eq('null'))
        .groupby('group_name', as_index=False).agg('first')
        .fillna('null')
)

new_df:

  group_name       z1        z2        z3      zN    x1    x2    x3
0          A   value1    value2    value3  valueN     a     b  null
1          B   value1    value2    value3  valueN  null  null  null
2          C   value1    value2    value3  valueN   abc   def  null
3          D  value99  value100  value101  valueN    xx    yy    zz

DataFrame used:

  group_name               z1         z2        z3         zN    x1    x2    x3
0          A           value1     value2    value3     valueN     a  None  None
1          A  different_value     value2    value3     valueN  None     b  None
2          B           value1     value2    value3     valueN  None  None  None
3          C           value1     value2    value3     valueN   abc   def  None
4          D          value99   value100  value101     valueN    xx    yy    zz
5          D         value999  value1000  value101  valueN200  None  None    ff

df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', None, None, 'abc', 'xx', None],
    'x2': [None, 'b', None, 'def', 'yy', None],
    'x3': [None, None, None, None, 'zz', 'ff']
})

DataFrame with string 'null' used:

  group_name               z1         z2        z3         zN    x1    x2    x3
0          A           value1     value2    value3     valueN     a  null  null
1          A  different_value     value2    value3     valueN  null     b  null
2          B           value1     value2    value3     valueN  null  null  null
3          C           value1     value2    value3     valueN   abc   def  null
4          D          value99   value100  value101     valueN    xx    yy    zz
5          D         value999  value1000  value101  valueN200  null  null    ff

df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', 'null', 'null', 'abc', 'xx', 'null'],
    'x2': ['null', 'b', 'null', 'def', 'yy', 'null'],
    'x3': ['null', 'null', 'null', 'null', 'zz', 'ff']
})

That did the trick! I wasn't aware of the `first` aggregation function. Thank you kindly for providing different options for solutions and the code to re-create the dataframe. — OmerM25, Jun 14 '21 at 09:22

Merge each group's rows into one row

1 Answers1

Related