How to get the most pairs out of my pandas dataframe?

Question

I have a dataframe that shows for every row a human pair (name_1 and name_2) together with the corresponding score. The score is a numeric value and represents how well these two people fit together. The higher the score, the better the match between person 1 (name_1) and person 2 (name_2).

As you can see, some names can be found twice or more. Of course, one person can only be matched once. My goal is, to find as many pairs in the dataframe as possible and write each of them into a second dataframe.

The problem that makes me struggle is this:

I think I can get max. 8 pairs out of the dataframe since I have 8 different names in the first column. Unfortunately, the scores for best matches are not clearly separated. One person can match with multiple other persons whereas other persons only can match to one specific person. I am not much interested in the matching-score. I am interested in not loosing any person due to a bad choice of pair combination.

I a looking for a way to find and extract as many pairs of of the dataframe.

This is the dataframe df:

      name_1     name_2  score
27      allen      jolly    1.8
23       anna       rock    2.8
22       anna  christina    1.1
26  christina       rock    2.3
24  christina      allen    1.4
25  christina      jolly    1.4
18      emily       rock    3.7
15      emily  sabastein    3.3
16      emily       anna    2.5
17      emily  christina    2.4
4       jacob      jolly    3.4
1       jacob       rick    2.9
3       jacob      allen    2.4
0       jacob       mary    2.3
2       jacob  christina    2.0
7        mary      jolly    1.7
5        mary       rick    1.4
6        mary  christina    1.3
14       rick       rock    2.8
9        rick  sabastein    2.8
8        rick      emily    2.5
13       rick      jolly    2.3
11       rick  christina    2.1
10       rick       anna    2.0
12       rick      allen    1.5
21  sabastein       rock    3.6
19  sabastein       anna    2.8
20  sabastein  christina    1.9

I think the best matching in terms of total maximum score is:

emely       rock        3.7
jacob       jolly       3.4
sabastein   anna        2.8
rick        allen       1.5
mary        christina   1.3

I am not absolutely sure if this is also the maximum number of pairs I can get. If you know how to get best pairs (see above) or maximum number of pairs, I would be really happy to see.

Afaik what you search for is not included in the set of algorithms of pandas - it's the maximum matching of a graph type problem. I think networkx should be capable to solve this. — SpghttCd, Jan 23 '19 at 16:46
Thanks a lot. Looking at the website documentation https://networkx.github.io/documentation/latest/tutorial.html I find it kind of hard to apply networkx to my specific problem. Any idea how to do it? Is there no simple approach to avoid complex bipartite graph theory? — PParker, Jan 23 '19 at 16:53
1. Let me state I'm not a graph theory expert, but I got in touch with bipartite graphs some years ago. And I never used networkx yet. 2. Bipartite graphs and finding a maximum matching is not a complex thing - you can program it yourself without any library within a few dozens of lines of code. Using networkx should reduce this however to perhaps guessed one dozen. 3. Your problem appears _not_ to be _bipartite_ as you don't have two distinct groups of participants. — SpghttCd, Jan 23 '19 at 18:07
@cosmic_inquiry I will add the desired output for the example in my question! @ SpghttCd I do not really need the optimum solution. I just want to get as many happy pairs out of the df as possible without considering the score too much. This is already a bit tricky in the example dataframe. The actual df is much longer and therefore more difficult to see the solution. I was hopping to get a "fuzzy" step by step approach how to tackle the problem. — PParker, Jan 23 '19 at 19:44

SpghttCd · Accepted Answer · 2019-01-27T22:55:50.330

1

EDIT
In the meantime I found a very convenient function to create a graph from a dataframe, but you should rename your column score to weight for this:
The you could simply write:

G = nx.from_pandas_edgelist(df, 'name_1', 'name_2', 'weight')
mate = nx.max_weight_matching(G)

and that's it.
(Rest is still part of our discussion below, how you process the result further on...)

My approach would be

import pandas as pd
import networkx as nx

df['edges'] = df.apply(lambda r: (r.name_1, r.name_2, {'weight': r.score}), axis=1)

G = nx.Graph()

allnames = set(df.loc[:, ['name_1', 'name_2']].values.flatten())

for s in allnames:
    G.add_node(s)
G.add_edges_from(df.edges)

mate = nx.max_weight_matching(G)

Result:

res = pd.DataFrame(list(mate), columns=['name_1', 'name_2'])
res['score'] = res.apply(lambda r: G[r[0]][r[1]]['weight'], axis=1)

print(res)
print(f'\nMatchings: {len(res)}\nTotal Score: {res.score.sum():.1f}')            

#      name_1     name_2  score
#0       rock      emily    3.7                            
#1       rick  christina    2.1                          
#2       mary      jacob    2.3                            
#3  sabastein       anna    2.8                           
#4      jolly      allen    1.8                                             
#Matchings: 5                                                
#Total Score: 12.7

DocSources:
For setting up the Graph you already had the correct link.
For the maximum_matching function see here https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.matching.max_weight_matching.html#networkx.algorithms.matching.max_weight_matching

edited Jan 27 '19 at 22:55

answered Jan 23 '19 at 22:49

SpghttCd

10,510
2
20
25

Thank you very much for your approach. It looks very interesting. Everything is running as expected, except one last error message at the end of the code. I think in this line 'print(f"{k}, {v}, {G[k][v]['weight']:.1f}")' is going sth. wrong. The error message I get is: ValueError: too many values to unpack (expected 2). I found this link https://stackoverflow.com/questions/1479776/too-many-values-to-unpack-exception but I did not know how to apply it to my problem – PParker Jan 26 '19 at 08:46
1

Ok - this is only the way the result is printed. Test perhaps `print(k, v, G[k][v]['weight'])`. Which python version are you working with? – SpghttCd Jan 26 '19 at 12:18
1

or better: `print("{}, {}, {:.1f}".format(k, v, G[k][v]['weight']))` - f-strings were introduced in python 3.6 – SpghttCd Jan 26 '19 at 15:37
I am using Python 3.6. Unfortunately I get the same error message as before. It must have sth. to do with the print command. The G-parameter works fine if I put values in it. At the end I would prefer a dataframe. Is it possible to return a dataframe instead of a print – PParker Jan 26 '19 at 16:30
@ SpghttCd The shape of the list in the first line doesn't match with the dataframe. Error message: Shape of passed values is (1, 10), indices imply (2, 10). When I remove the second name "name_2", it works, but because of that the index of the second line is incorrect. So I guess the problem is to make a correct dataframe with two columns "name_1" and "name_2". Afterwards I hope the following code will work properly. – PParker Jan 27 '19 at 10:53
1

Strange, no error with the above code here. `type(mate)` is `set` here, although doc says `dict`. So I intended to initialize the dataframe with a list of 2-tuples, which leads to two columns, so the columns parameter would match. Do you apply the above code to your sample data from the question or to some other different data? Because here `len(mate)` is only 5, not 10 as stated... – SpghttCd Jan 27 '19 at 11:30
ok, this is interesting. I am using the identical df as shown in my very first post. `type(made)` is 'dict' and 'made' looks like this `{'allen': 'jolly', 'anna': 'sabastein', 'christina': 'rick', 'emily': 'rock', 'jacob': 'mary', 'jolly': 'allen', 'mary': 'jacob', 'rick': 'christina', 'rock': 'emily', 'sabastein': 'anna'}` . The length of made is `len(made) --> 10` . I do not understand why it is different to yours! I could set up a new environment and double check the result but actually it should work – PParker Jan 27 '19 at 15:51
1

Then I think we are using different versions of `networkx` - and as your datatype of the maximum matching result corresponds to the documentation, I think mine is too old. However, the creation of a dataframe out of my `set` was quite easy, while I find it a little odd that the dict is twice as long as the number of matchings, only to represent both directions of matchings... – SpghttCd Jan 27 '19 at 17:56
1

PS: my version is 2.2 which is the latest stable, so...?? – SpghttCd Jan 27 '19 at 17:59
1

You could try sth like `res = pd.DataFrame([mate.keys(), mate.values()]).T` – SpghttCd Jan 27 '19 at 18:30
1

Just checked another installation: nx version 2.1 also returns a `set` with `max_weight_matching` – SpghttCd Jan 27 '19 at 18:58
Thank you very much for your help!!! I use `res = pd.DataFrame([mate.keys(), mate.values()]).T` which gives me the desired dataframe. For some reason I get with the df with 10 matches and not 5 as in your case. The more matches the better for me. Maybe I am using a different version than you do. – PParker Jan 28 '19 at 17:44
1

Nope, misunderstanding: you get the same number of matches as me (most probably even the very same matches). The only difference is, that in my case, the 5 matches are 5 `tuple`s in a `set`, while in your resulting `dict` these 5 matches appear in the first half of the `dict` like `name_1` as `key` and `name_2` as `value` and in the second half vice versa - they are listed twice. Therefore 10 `key`/`value` pairs, but only 5 matches. - Just for interests sake: could you please post the result of your `nx.__version__`? – SpghttCd Jan 28 '19 at 21:41
1

(and just to state clearly: with the 10 different names in your datasaet above, you'll obviously never get more than 5 matches, this is the maximum anyway) – SpghttCd Jan 28 '19 at 21:50

How to get the most pairs out of my pandas dataframe?

1 Answers1

Linked