PROBLEM: I have a dataframe showing which assignments students chose to do and what grades they got on them. I am trying to determine which subsets of assignments were done by the most students and the total points earned on them. The method I'm using is very slow, so I'm wondering what the fastest way is.
My data has this structure:
STUDENT | ASSIGNMENT1 | ASSIGNMENT2 | ASSIGNMENT3 | ... | ASSIGNMENT20 |
---|---|---|---|---|---|
Student1 | 50 | 75 | 100 | ... | 50 |
Student2 | 75 | 25 | NaN | ... | NaN |
... | |||||
Student2000 | 100 | 50 | NaN | ... | 50 |
TARGET OUTPUT: For every possible combination of assignments, I'm trying to get the number of completions and the sum of total points earned on each individual assignment by the subset of students who completed that exact assignment combo:
ASSIGNMENT_COMBO | NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO | ASSIGNMENT1 TOTAL POINTS | ASSIGNMENT2 TOTAL POINTS | ASSIGNMENT3 TOTAL POINTS | ... | ASSIGNMENT20 TOTAL POINTS |
---|---|---|---|---|---|---|
Assignment 1, Assignment 2 | 900 | 5000 | 400 | NaN | ... | NaN |
Assignment 1, Assignment 2, Assignment 3 | 100 | 3000 | 500 | ... | NaN | |
Assignment 2, Assignment 3 | 750 | NaN | 7000 | 750 | ... | NaN |
... | ||||||
All possible combos, including any number of assignments |
WHAT I'VE TRIED: First, I'm using itertools to make my assignment combos and then iterating through the dataframe to classify each student by what combos of assignments they completed:
for combo in itertools.product(list_of_assignment_names, repeat=20):
for i, row in starting_data.iterrows():
ifor = str(combo)
ifor_val = 'no'
for item in combo:
if row[str(item)]>0:
ifor_val = 'yes'
starting_data.at[i,ifor] = ifor_val
Then, I make a second dataframe (assignmentcombostats) that has each combo as a row to count up the number of students who did each combo:
numberofstudents =[]
for combo in assignmentcombostats['combo']:
column = str(combo)
number = len(starting_data[starting_data[column] == 'yes'])
numberofstudents.append(number)
assignmentcombostats['numberofstudents'] = numberofstudents
This works, but it is very slow.
RESOURCES: I've looked at a few resources -