I have a spark dataframe (prof_student_df
) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame.
For example, here are the pairings/scores for one time frame.
+------------+--------------+------------+-------+----------+
| time | professor_id | student_id | score | is_match |
+------------+--------------+------------+-------+----------+
| 1596048041 | p1 | s1 | 0.7 | FALSE |
| 1596048041 | p1 | s2 | 0.5 | TRUE |
| 1596048041 | p1 | s3 | 0.3 | FALSE |
| 1596048041 | p1 | s4 | 0.2 | FALSE |
| 1596048041 | p2 | s1 | 0.9 | TRUE |
| 1596048041 | p2 | s2 | 0.1 | FALSE |
| 1596048041 | p2 | s3 | 0.15 | FALSE |
| 1596048041 | p2 | s4 | 0.2 | FALSE |
| 1596048041 | p3 | s1 | 0.2 | FALSE |
| 1596048041 | p3 | s2 | 0.3 | FALSE |
| 1596048041 | p3 | s3 | 0.4 | FALSE |
| 1596048041 | p3 | s4 | 0.8 | TRUE |
| 1596048041 | p4 | s1 | 0.2 | FALSE |
| 1596048041 | p4 | s2 | 0.3 | FALSE |
| 1596048041 | p4 | s3 | 0.35 | TRUE |
| 1596048041 | p4 | s4 | 0.4 | FALSE |
+------------+--------------+------------+-------+----------+
The goal Is to get this is_match column. It can be a boolean or a 0/1 bit or whatever works.
In the above example, p1 matched with s2, p2 matched with s1, p3 matched with s4 and p4 matched with s3 because that is the combination that maximized the total score (yields a score of 2.55). There is one weird edge case - it is possible to have LESS than 4 professors or students for a given time frame. If there are 4 professors and 3 students then 1 professor would be without a pairing and all of his is_match would be false. Similarly, if there are 3 professors and 4 students, 1 student would be without a pairing and all of his is_match would be false.
Does anyone know how I might accomplish this? i am thinking I would partition or group by time and then feed the data into some UDF that spits out the pairings and then maybe I would have to join that back to the original rows (although I am not sure). I am trying to implement this logic in pyspark and can use spark sql/sql or pyspark.
Ideally, I would like this to be as efficient as possible as there will be millions of rows. In the question, I mentioned a recursive algorithm because this is a traditional recursive type problem, but if there is a quicker solution that doesn't use recursion I am open to that.
many thanks, I am new to spark and a little stumped with how to do this.
EDIT: clarifying the question as I realize in my example I did not specify this
for a single day, there will be up to 14 professors and 14 students to choose from. I am just looking at one day at a time which is why I didnt have the date in the dataframe. at any one time frame, there is at most 4 professors and 4 students. this dataframe just shows one time frame. but for the next time frame it is possible that the 4 professors are p5
, p1
, p7
, p9
or something like that. the students might still be s1
, s2
, s3
, s4
.