-1

I'm experienced with Pandas but stumbled upon a problem that I can't seem to figure out.

I have a large dataset ((40,000, 16)) and I am trying to group it by a specific column ("group_name" for this matter) and then for each group apply the following rules so it'd merge into one row per group:

  • x1, x2, x3 are the "important" columns, if one row has less nulls than the others, take it. (see example with row D)
  • If there are conflicts in any column, it's arbitrary and we can pick whatever.
  • Combine the nulls on the important fields (x1, x2, x3), see example with row A.

Here is an example with 6 rows that should turn into 4 groups (aka 4 rows).

Input, result and some explanations.

So far I have groups = df.groupby['group_name']

I tried many other solutions such as summing each group, applying a transformation, aggregating by each 'important' column, merging on each 'important' column and more. Each solution brought it's own problems so I'm offering this question here without limiting people to a certain way.

Also, I spent nearly two days combining different solutions from other questions but none has seem to work. Perhaps I've missed something.

  • Please note that since this is a large dataset, I'd very much like to avoid using for loop on each group since efficiency is something to consider here.

I hope I explained everything properly, please let me know if something is unclear.

Code to re-create the dataframe (thanks to @Henry Ecker from the first answer):

df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', None, None, 'abc', 'xx', None],
    'x2': [None, 'b', None, 'def', 'yy', None],
    'x3': [None, None, None, None, 'zz', 'ff']
})
OmerM25
  • 243
  • 2
  • 13
  • Does this answer your question? [pandas-merging-101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Anurag Dabas Jun 13 '21 at 16:17
  • 2
    Please provide the expected [MRE - Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Show where the intermediate results deviate from the ones you expect. We should be able to paste a single block of your code into file, run it, and reproduce your problem. This also lets us test any suggestions in your context. [Include your minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of the example. – Prune Jun 13 '21 at 16:36
  • 1
    Off-site links and images of text are not acceptable; your posting must be self-contained, in keeping with the purpose of this site. – Prune Jun 13 '21 at 16:37
  • @Prune sorry I wasn't aware that a picture is not enough, Henry Ecker already did that in his answer so I added that in the question, apologies. – OmerM25 Jun 14 '21 at 09:20
  • @AnuragDabas I tried mixing several solutions there so it'd fit this specific problem but I couldn't find anything. – OmerM25 Jun 14 '21 at 09:21

1 Answers1

2

Try with groupby aggregate 'first' to get the first (valid) value from every column for each group_name:

new_df = df.groupby('group_name', as_index=False).agg('first')

new_df:

  group_name       z1        z2        z3      zN    x1    x2    x3
0          A   value1    value2    value3  valueN     a     b  None
1          B   value1    value2    value3  valueN  None  None  None
2          C   value1    value2    value3  valueN   abc   def  None
3          D  value99  value100  value101  valueN    xx    yy    zz

*Note if those are string 'null' mask them out first + fillna to put them back:

new_df = (
    df.mask(df.eq('null'))
        .groupby('group_name', as_index=False).agg('first')
        .fillna('null')
)

new_df:

  group_name       z1        z2        z3      zN    x1    x2    x3
0          A   value1    value2    value3  valueN     a     b  null
1          B   value1    value2    value3  valueN  null  null  null
2          C   value1    value2    value3  valueN   abc   def  null
3          D  value99  value100  value101  valueN    xx    yy    zz

DataFrame used:

  group_name               z1         z2        z3         zN    x1    x2    x3
0          A           value1     value2    value3     valueN     a  None  None
1          A  different_value     value2    value3     valueN  None     b  None
2          B           value1     value2    value3     valueN  None  None  None
3          C           value1     value2    value3     valueN   abc   def  None
4          D          value99   value100  value101     valueN    xx    yy    zz
5          D         value999  value1000  value101  valueN200  None  None    ff
df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', None, None, 'abc', 'xx', None],
    'x2': [None, 'b', None, 'def', 'yy', None],
    'x3': [None, None, None, None, 'zz', 'ff']
})

DataFrame with string 'null' used:

  group_name               z1         z2        z3         zN    x1    x2    x3
0          A           value1     value2    value3     valueN     a  null  null
1          A  different_value     value2    value3     valueN  null     b  null
2          B           value1     value2    value3     valueN  null  null  null
3          C           value1     value2    value3     valueN   abc   def  null
4          D          value99   value100  value101     valueN    xx    yy    zz
5          D         value999  value1000  value101  valueN200  null  null    ff
df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', 'null', 'null', 'abc', 'xx', 'null'],
    'x2': ['null', 'b', 'null', 'def', 'yy', 'null'],
    'x3': ['null', 'null', 'null', 'null', 'zz', 'ff']
})
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • 1
    That did the trick! I wasn't aware of the `first` aggregation function. Thank you kindly for providing different options for solutions and the code to re-create the dataframe. – OmerM25 Jun 14 '21 at 09:22