6

I have a dataframe with different people. Each row contains attributes which characterize the individual person. Basically I need something like a filter or matching algorithm which weights specific attributes. The dataframe looks like this:

df= pd.DataFrame({
'sex' : [m,f,m,f,m,f],
'food' : [0,0,1,3,4,3],
 'age': [young, young, young, old, young, young]
'kitchen': [0,1,2,0,1,2],
})

The dataframe df looks like this:

    sex food  age     kitchen
0   m    0    young    0
1   f    0    young    1
2   m    1    young    2
3   f    3    old      0
4   m    4    young    1
5   f    3    young    2

I am looking for an algorithm which groups all people of the dataframe to pairs. My plan is to find pairs of two people based on the following attributes:

  1. One person must have a kitchen (kitchen=1)
    It is important that at least one person has a kitchen.

    kitchen=0 --> person has no kitchen

    kitchen=1 --> person has a kitchen

    kitchen=2 --> person has a kitchen but only in emergency (when there is no other option)

  2. Same food preferences

    food=0 --> meat eater

    food=1 --> does not matter

    food=2 --> vegan

    food=3 --> vegetarian

    A meat eater (food=0) can be matched with a person who doesn't care about food preferences (food=1) but can't be matched with a vegan or vegetarian. A vegan (food=2) fits best with a vegetarian (food=3) and, if necessary, can go with food=1. And so on...

  3. Similar age

    There are nine age groups: 10-18; 18-22; 22-26; 26-29, 29-34; 34-40; 40-45; 45-55 and 55-75. People in the same age group match perfectly. The young age groups with the older age groups do not match very well. Similar age groups match a little bit better. There is no clearly defined condition. The meaning of "old" and "young" is relative.

The sex doesn't matter. There are many pair combinations possible. Because my actual dataframe is very long (3000 rows), I need to find an automated solution. A solution that gives me the best pairs in a dataframe or dictionary or something else.

I really do not know how to approach this problem. I was looking for similar problems on Stack Overflow, but I did not find anything suitable. Mostly it was just too theoretically. Also I could not find anything that really fits my problem.

My expected output here would be, for example a dictionary (not sure how) or a dataframe which is sorted in a way that every two rows can be seen as one pair.

Background: The goal is to make pairs for some free time activities. Therefore I think, people in same or similar age groups share same interest, therefore I want to consider this fact in my code.

Georgy
  • 12,464
  • 7
  • 65
  • 73
PParker
  • 1,419
  • 2
  • 10
  • 25
  • @PParker Referring to your statement "If possible, the pairs are in the same age group. If this is not possible, then maybe in a similar age group." Do you have numeric age in age column or only two string values: "Young" and "Old"? – Anidhya Bhatnagar Jan 01 '19 at 16:05
  • 1
    @PParker You might want to take a look at the following resources: [Maximize pairings subject to distance constraint](https://cs.stackexchange.com/questions/76445/maximize-pairings-subject-to-distance-constraint), [Blossom algorithm](https://en.wikipedia.org/wiki/Blossom_algorithm), [Matching with constraints](https://stackoverflow.com/questions/20205154/matching-with-constraints), [Assignment problem](https://en.wikipedia.org/wiki/Assignment_problem), [Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm). – a_guest Jan 01 '19 at 19:23
  • @AnidhyaBhatnagar In my simplified example I only have two age groups (young and old). However, in the real dataframe I have numeric age. My plan is to make multiple age-groups (for example "very young", "young", "old", "very old",...). – PParker Jan 02 '19 at 15:17

3 Answers3

12

I have done an addition by putting 'name' as a key to identify the person.

Approach

The approach is that I have scored the values which is further used to filter the final pairs according to the given conditions.

Scoring for Kitchen

For kitchen scores we used:

  • Person has no kitchen : 0
  • Person has a kitchen : 1
  • Person has kitchen but only in emergency : 0.5

if Condition Logic for kitchen

We check that if [kitchen score of record 1] + [kitchen score of record 2] is greater than Zero. As the following cases will be there:

  1. Both Members have no kitchen (sum will be 0) [EXCLUDED with > 0 Condition]
  2. Both Members have kitchen (sum will be 2)
  3. One Member have kitchen and other have no kitchen (sum will be 1)
  4. Both have emergency kitchen (sum will be 1)
  5. One have emergency kitchen and other have kitchen (sum will be 1.5)
  6. One Member have emergency kitchen and other have no kitchen (sum will be 0.5)

Scoring for Food

For food scores we used:

  • food = 0 --> meat eater : -1
  • food = 1 --> does not matter : 0
  • food = 2 --> vegan : 1
  • food = 3 --> vegetarian : 1

if Condition Logic for Food

We check if *[food score of record 1] * [food score of record 2]* is greater than or equal to Zero. As the following cases will be there:

  1. Both Members are Meat Eater : -1 x -1 = 1 [INCLUDED]
  2. One of the Member is Meat Eater and Other Vegan or Vegetarian : -1 x 1 = -1 [EXCLUDED]
  3. One of the Member is Meat Eater and Other Does Not Matter : -1 x 0 = 0 [INCLUDED]
  4. One of the Member is Vegan or Vegetarian and Other Does Not Matter : 1 x 0 = 0 [INCLUDED]
  5. Both of the Members are Either Vegan or Vegetarian : 1 x 1 = 1 [INCLUDED]

Scoring for Age Groups

For scoring age groups, we assigned some values to the groups as:

  • 10-18 : 1
  • 18-22 : 2
  • 22-26 : 3
  • 26-29 : 4
  • 29-34 : 5
  • 34-40 : 6
  • 40-45 : 7
  • 45-55 : 8
  • 55-75 : 9

Age Score Calculation

For calculating Age Score the following formula has been used: age_score = round((1 - (abs(Age Group Value Person 1 - Age Group Value of Person 2) / 10)), 2)

In the above formula we calculation has been done as follows:

  1. First we calculated the absolute value of the difference between the values of the age groups of the two persons.
  2. Then we divide it by 10 to normalize it.
  3. Further we subtracted this value from 1 to inverse the distance, so after this step we have higher value for persons in similar or closer age groups and lower value for persons in different or farther age groups.

Cases will be as:

  1. 18-22 and 18-22 : round(1 - (abs(2 - 2) / 10), 2) = 1.0
  2. 45-55 and 45-55 : round(1 - (abs(8 - 8) / 10), 2) = 1.0
  3. 18-22 and 45-55 : round(1 - (abs(2 - 8) / 10), 2) = 0.4
  4. 10-18 and 55-75 : round(1 - (abs(1 - 9) / 10), 2) = 0.2

Final Score Calculation

For calculating final Score we used:

Final Score = Food Score + Kitchen Score + Age Score

Then we have sorted the data on Final Score to obtain best Pairs.

Solution Code

import pandas as pd
import numpy as np

# Creating the DataFrame, here I have added the attribute 'name' for identifying the record.
df = pd.DataFrame({
    'name' : ['jacob', 'mary', 'rick', 'emily', 'sabastein', 'anna', 
              'christina', 'allen', 'jolly', 'rock', 'smith', 'waterman', 
              'mimi', 'katie', 'john', 'rose', 'leonardo', 'cinthy', 'jim', 
              'paul'],
    'sex' : ['m', 'f', 'm', 'f', 'm', 'f', 'f', 'm', 'f', 'm', 'm', 'm', 'f', 
             'f', 'm', 'f', 'm', 'f', 'm', 'm'],
    'food' : [0, 0, 1, 3, 2, 3, 1, 0, 0, 3, 3, 2, 1, 2, 1, 0, 1, 0, 3, 1],
    'age' : ['10-18', '22-26', '29-34', '40-45', '18-22', '34-40', '55-75',
             '45-55', '26-29', '26-29', '18-22', '55-75', '22-26', '45-55', 
             '10-18', '22-26', '40-45', '45-55', '10-18', '29-34'],
    'kitchen' : [0, 1, 2, 0, 1, 2, 2, 1, 0, 0, 1, 0, 1, 1, 1, 0, 2, 0, 2, 1],
})

# Adding a normalized field 'k_scr' for kitchen
df['k_scr'] = np.where((df['kitchen'] == 2), 0.5, df['kitchen'])

# Adding a normalized field 'f_scr' for food
df['f_scr'] = np.where((df['food'] == 1), 0, df['food'])
df['f_scr'] = np.where((df['food'] == 0), -1, df['f_scr'])
df['f_scr'] = np.where((df['food'] == 2), 1, df['f_scr'])
df['f_scr'] = np.where((df['food'] == 3), 1, df['f_scr'])

# Adding a normalized field 'a_scr' for age
df['a_scr'] = np.where((df['age'] == '10-18'), 1, df['age'])
df['a_scr'] = np.where((df['age'] == '18-22'), 2, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '22-26'), 3, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '26-29'), 4, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '29-34'), 5, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '34-40'), 6, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '40-45'), 7, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '45-55'), 8, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '55-75'), 9, df['a_scr'])

# Printing DataFrame after adding normalized score values
print(df)

commonarr = [] # Empty array for our output
dfarr = np.array(df) # Converting DataFrame to Numpy Array
for i in range(len(dfarr) - 1): # Iterating the Array row
    for j in range(i + 1, len(dfarr)): # Iterating the Array row + 1
        # Check for Food Condition to include relevant records
        if dfarr[i][6] * dfarr[j][6] >= 0: 
            # Check for Kitchen Condition to include relevant records
            if dfarr[i][5] + dfarr[j][5] > 0:
                row = []
                # Appending the names
                row.append(dfarr[i][0])
                row.append(dfarr[j][0])
                # Appending the final score
                row.append((dfarr[i][6] * dfarr[j][6]) +
                           (dfarr[i][5] + dfarr[j][5]) +
                           (round((1 - (abs(dfarr[i][7] -
                                            dfarr[j][7]) / 10)), 2)))

                # Appending the row to the Final Array
                commonarr.append(row)

# Converting Array to DataFrame
ndf = pd.DataFrame(commonarr)

# Sorting the DataFrame on Final Score
ndf = ndf.sort_values(by=[2], ascending=False)
print(ndf)

Input / Intermediate DataFrame with Scores

         name sex  food    age  kitchen  k_scr  f_scr a_scr
0       jacob   m     0  10-18        0    0.0     -1     1
1        mary   f     0  22-26        1    1.0     -1     3
2        rick   m     1  29-34        2    0.5      0     5
3       emily   f     3  40-45        0    0.0      1     7
4   sabastein   m     2  18-22        1    1.0      1     2
5        anna   f     3  34-40        2    0.5      1     6
6   christina   f     1  55-75        2    0.5      0     9
7       allen   m     0  45-55        1    1.0     -1     8
8       jolly   f     0  26-29        0    0.0     -1     4
9        rock   m     3  26-29        0    0.0      1     4
10      smith   m     3  18-22        1    1.0      1     2
11   waterman   m     2  55-75        0    0.0      1     9
12       mimi   f     1  22-26        1    1.0      0     3
13      katie   f     2  45-55        1    1.0      1     8
14       john   m     1  10-18        1    1.0      0     1
15       rose   f     0  22-26        0    0.0     -1     3
16   leonardo   m     1  40-45        2    0.5      0     7
17     cinthy   f     0  45-55        0    0.0     -1     8
18        jim   m     3  10-18        2    0.5      1     1
19       paul   m     1  29-34        1    1.0      0     5

Output

             0          1    2
48   sabastein      smith  4.0
10        mary      allen  3.5
51   sabastein      katie  3.4
102      smith        jim  3.4
54   sabastein        jim  3.4
99       smith      katie  3.4
61        anna      katie  3.3
45   sabastein       anna  3.1
58        anna      smith  3.1
14        mary       rose  3.0
12        mary       mimi  3.0
84       allen     cinthy  3.0
98       smith       mimi  2.9
105   waterman      katie  2.9
11        mary      jolly  2.9
50   sabastein       mimi  2.9
40       emily      katie  2.9
52   sabastein       john  2.9
100      smith       john  2.9
90        rock      smith  2.8
47   sabastein       rock  2.8
0        jacob       mary  2.8
17        mary       paul  2.8
13        mary       john  2.8
119      katie        jim  2.8
116       mimi       paul  2.8
111       mimi       john  2.8
103      smith       paul  2.7
85       allen       paul  2.7
120      katie       paul  2.7
..         ...        ...  ...

This solution has further scope of optimization.

  • 1
    Thank you very much for this great approach to my problem and sorry for my late response. I have one more question I would like you to ask: 1) Let's say I have multiple age groups. In my case I have 9 age groups. I assume I have to change the equation. However, I am not sure what score I should give to each age group. Maybe you can help me here! – PParker Jan 07 '19 at 18:28
  • @PParker Thanks for the response. Please accept as right answer if it worked for you. For multiple age groups I need to know the relation between these nine age groups and their match priorities to define the equation. I will give it a try. – Anidhya Bhatnagar Jan 08 '19 at 12:22
  • 1
    Thank you for your help! I have nine age groups: 10-18; 18-22; 22-26; 26-29, 29-34; 34-40; 40-45; 45-55 and 55-75. People in the same age group match perfectly. The young age groups with the older age groups do not match very well. Similar age groups match a little bit better. There is no clearly defined condition. The meaning of "old" and "young" is relative. Background: The goal is to make pairs for some free time activities. Therefore I think, people in same or similar age groups share same interest, therefore I want to consider this fact in my code. – PParker Jan 08 '19 at 19:56
  • 1
    I will try if we can use some sort of distance score between the groups. The farther the group lesser the score. But need to see that at the same time it should not make the overall score bias or override the effect of other properties. Let me think of some solution. – Anidhya Bhatnagar Jan 09 '19 at 04:58
  • 1
    @PParker I have updated the question with the additional information you provided on age groups as well as updated the answer and its explanation also. Now the program considers the distance between the two persons age groups and then score them on the basis of their distance. Hope this helps. – Anidhya Bhatnagar Jan 13 '19 at 03:13
  • @ AnidhyaBhatnagar Thank you so much for this extraordinary solution. It works great. Also thanks for your detailed explanations. Let me ask you some more general questions: 1.) Is it correct to say, that (depending on the people) not everyone gets a partner? 2.) Based on the score, how would you choose the actual partner? I am confused, because for example sabastein has highest score with smith (4.0) but smith has highest score with jim (3.4). Maybe there is an explicit way to pick the pairs out of the final dataframe? – PParker Jan 13 '19 at 09:37
  • 1
    @PParker 1.) Yes it is correct to say that depending upon the people and their features everyone will not get a partner. 2.) Do not get confused, Sabastein has the highest score with Smith also the Smith has the highest score with Sabastein. Because if you match Sabastein -> Smith or Smith -> Sabastein the score will be same 4.0. So when you pick out a pair say (Sabastein & Smith) then ignore all the further pairs which have either Sabastein or Smith. So you will get unique pairs out of it. – Anidhya Bhatnagar Jan 13 '19 at 11:45
  • 1
    Thank you very much for your answer. I just realized sth. new: We now have this very long dataframe with different pair combinations and scores. It turns out that one person can be best matched with multiple people. For example, Anna is working best with Katie (3.3), but Christina also works best with Katie (2.4) . There are many examples like this. So I guess the best approach is, to make a function that picks the best pairs out of the dataframe. The goal is, to find as many good pairs as possible. Maybe you can comment on that. How would you generally approach to this? – PParker Jan 13 '19 at 16:56
  • @PParker Yes! the ultimate aim is to match and create the best pairs. Well this is a tricky one. Thinking upon this needs some time and working with some sample to reach the best solution. (like suppose A and B are best match, but A and C, B and D have some match, but C and D are definitely no match, here it would be best if A and C are matched B and D are matched.) I am going bit busy now a days - will look it on next weekend. Till then you can try taking out a pair & then discarding all the next record with either of them on any side, then picking a unique one and discarding and so on... – Anidhya Bhatnagar Jan 24 '19 at 18:33
1

This seems like a very interesting problem to me. There are several ways to solve this problem. I will state you one, but will link you to another solution which I feel is somehow related.

A possible approach could be to create a additional column in your dataframe, including a 'code' which refers to the given attributes. For example:

    sex  food  age      kitchen   code
0   m    0     young    0         0y0
1   f    0     young    1         0y1
2   m    1     young    2         1y2
3   f    3     old      0         3o0
4   m    4     young    1         4y1
5   f    3     young    2         3y2

This 'code' is made up of shorts of your attributes. Since the sex doesn't matter, the first sign in the code stands for the 'food', the second one for the 'age' and the third for the 'kitchen'.

4y1 = food 4, age young, kitchen 1.

Based on these codes you can come up with a pattern. I recommend that you're working with Regular Expressions for this. You can then write something like this:

import re
haskitchen = r'(\S\S1)
hasnokitchen = r'(\S\S0)
df_dict = df.to_dict

match_kitchen = re.findall(haskitchen, df_dict)
match_nokitchen = re.dinfall(hasnokitchen, df_dict)

kitchendict["Has kitchen"] = [match_kitchen]
kitchendict["Has no kitchen"] = [match_notkitchen]

Based on this, you can loop over entries and put them together how you want. There may be a much easier solution and I didn't proof the code, but this just came up in my mind. One thing is for sure: Use regular expressions for matching.

Mowgli
  • 157
  • 1
  • 10
1

Well, let's test for the kitchen.

for I in(kitchen):
    if (I != 0):
        print("Kitchen Found)
    else:
        print("No kitchen")

Okay now that we have found a kitchen in the people who have a kitchen's houses, let's find the people without the kitchen someone with similar food preferences. Let's create a variable that tells us how many people have a kitchen(x). Let's also make the person variable for counting people.

people = 0
x = 0
for I in(kitchen):
    x = x + 1
    for A in (food):
            if (I != 0):
                x = x + 1
                print("Kitchen Found)
            else:
                print("No kitchen")
                for J in(food):
                    if(i == J):
                        print("food match found")
                    elif(A == 0):
                        if(J == 1):
                            print("food match found for person" + x)
                    elif(A == 2 or A == 3):
                        if(J == 2 or J == 3 or J == 1):
                            print("food match found for person" + x)

I am currently working on the age part adjusting somethings

Dodge
  • 64
  • 5