I'm experienced with Pandas but stumbled upon a problem that I can't seem to figure out.
I have a large dataset ((40,000, 16)) and I am trying to group it by a specific column ("group_name" for this matter) and then for each group apply the following rules so it'd merge into one row per group:
- x1, x2, x3 are the "important" columns, if one row has less nulls than the others, take it. (see example with row D)
- If there are conflicts in any column, it's arbitrary and we can pick whatever.
- Combine the nulls on the important fields (x1, x2, x3), see example with row A.
Here is an example with 6 rows that should turn into 4 groups (aka 4 rows).
So far I have
groups = df.groupby['group_name']
I tried many other solutions such as summing each group, applying a transformation, aggregating by each 'important' column, merging on each 'important' column and more. Each solution brought it's own problems so I'm offering this question here without limiting people to a certain way.
Also, I spent nearly two days combining different solutions from other questions but none has seem to work. Perhaps I've missed something.
- Please note that since this is a large dataset, I'd very much like to avoid using
for
loop on each group since efficiency is something to consider here.
I hope I explained everything properly, please let me know if something is unclear.
Code to re-create the dataframe (thanks to @Henry Ecker from the first answer):
df = pd.DataFrame({
'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
'z1': ['value1', 'different_value', 'value1',
'value1', 'value99', 'value999'],
'z2': ['value2'] * 4 + ['value100', 'value1000'],
'z3': ['value3'] * 4 + ['value101', 'value101'],
'zN': ['valueN'] * 5 + ['valueN200'],
'x1': ['a', None, None, 'abc', 'xx', None],
'x2': [None, 'b', None, 'def', 'yy', None],
'x3': [None, None, None, None, 'zz', 'ff']
})