Panda, Statsbomb - join rows on pair condition

Question

I'm working with a dataframe structured as this:


  | Sender | recipient | n_pass |  other  |
--| ------ | --------- | ------ | ------- |
0 |  Emma  |    Lisa   |    1   | other_a |
1 |  Lisa  |    Emma   |    1   | other_b | 
2 |  Anna  |    Lisa   |    1   | other_c |
3 |  Lisa  |    Anna   |    1   | other_d |
4 |  Emma  |    Jade   |    1   | other_e |
5 |  Lisa  |    Jade   |    1   | other_f |
6 |  Jade  |    Lisa   |    1   | other_g |

My objective is to to merge the rows for each couple of players, regardless if they are senders or recipients, or no merge if there is no other combination.

Output should look like this:

  | Player_1 | Player_2 | n_pass |  other  |
--| -------- | -------- | ------ | ------- |
0 |   Emma   |    Lisa  |    2   | other_a |
1 |   Anna   |    Lisa  |    2   | other_c |
2 |   Emma   |    Jade  |    1   | other_e |
3 |   Lisa   |    Jade  |    2   | other_f |

Note:

Summing the n_pass and dropping the remaining rows is sufficient as there are other columns that don't need to be merged or changed
I used names in this example for simplicity but the real condition will be done on unique numerical IDs for each player

I'm looking for something more efficient than my current solution using a nested loop.

I changed the duplicate targets out to be more applicable, though your editing mistake did make it so the original target made sense. I think the new ones should work. See as an example [this answer](/a/53662762/15497888). Sort across rows: `df[['Sender', 'recipient']] = np.sort(df[['Sender', 'recipient']], axis=1)` Then group normally: `df.groupby(['Sender', 'recipient']).agg(n_pass=('n_pass', 'sum'), other=('other', 'first')).reset_index()` — Henry Ecker, Dec 08 '22 at 17:22
Okay it works, but I can't add the "other". In detail I have ("duoId","name1","name2"). Error is "Must provide 'func' or tuples of '(column, aggfunc)." — Vitasalato, Dec 08 '22 at 18:23
This is why it's helpful to have a realistic example when you post questions. You can't just add additional column names into the "other" `agg` without passing an aggfunc for every column. The error message outlines this, you have to pass a tuple for every column: `...agg(..., col1=('col1', 'first'), col2=('col2', 'first'))` (or whatever) — Henry Ecker, Dec 08 '22 at 18:26

score 0 · Answer 1 · answered Dec 08 '22 at 09:26

Example

data = [['Emma', 'Lisa', 1, 'other_a'],
        ['Lisa', 'Emma', 1, 'other_b'],
        ['Anna', 'Lisa', 1, 'other_c'],
        ['Lisa', 'Anna', 1, 'other_d'],
        ['Emma', 'Jade', 1, 'other_e'],
        ['Lisa', 'Jade', 1, 'other_f'],
        ['Jade', 'Lisa', 1, 'other_g']]
col = ['Sender', 'recipient', 'n_pass', 'other']
df = pd.DataFrame(data, columns=col)

Code

make dataframe has sorted Sender and recipient

df1 = df[['Sender', 'recipient']].apply(lambda x: pd.Series(x.sort_values().values), axis=1)

df1

    0       1
0   Emma    Lisa
1   Emma    Lisa
2   Anna    Lisa
3   Anna    Lisa
4   Emma    Jade
5   Jade    Lisa
6   Jade    Lisa

make condition by df1 and boolean indexing

df[~df1.duplicated()].reset_index(drop=True)

result:

    Sender  recipient   n_pass  other
0   Emma    Lisa        1   other_a
1   Anna    Lisa        1   other_c
2   Emma    Jade        1   other_e
3   Lisa    Jade        1   other_f

Thanks for the answer, however I made a mistake in an edit and the desired output was wrong. n_pass must be summed before removing the duplicates — Vitasalato, Dec 08 '22 at 10:18

Panda, Statsbomb - join rows on pair condition

1 Answers1