TL;DR: I understand that .apply() is slow in pandas. However, I have a function that acts on indexes, which I cannot figure out how to vectorize. I want this function to act on two sets of parameters (500,000 and 1,500 long, respectively). I want to produce a dataframe with the first parameter as the row index, the second parameter as column names, and the cells containing the function's output for that particular row and column. As it stands it looks like the code will take several days to run. More details and minimal reproducible example below:
INPUT DATA:
I have a series of unique student IDs, which is 500,000 students long. I have a df (exam_score_df) that is indexed by these student IDs, containing each student's corresponding scores in math, language, history, and science.
I also have a series of school codes (each school code corresponds to a school), which is 1,500 schools long. I have a df (school_weight_df) that is indexed by school codes, containing the school's weights in math, language, history, and science that it uses to calculate a student's scores. Each row also contains a 'Y' or 'N' indexed 'Alternative_Score' because some schools allow you to take the best subject score between history and science to calculate your overall score.
FUNCTION I WROTE TO BE VECTORIZED:
def calc_score(student_ID, program_code):
'''
For a given student and program, returns students score for that program.
'''
if school_weight_df.loc[program_code]['Alternative_Score'] == 'N':
return np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST', 'SCI']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST','%SCI']]))
elif school_weight_df.loc[program_code]['Alternative_Score'] == 'Y':
history_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST']]))
science_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'SCI']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%SCI']]))
return max(history_score, science_score)
EXAMPLE DFs:
Here are example dfs for exam_score_df and school_weight_df:
student_data = [[3, 620, 688, 638, 688], [5, 534, 602, 606, 700], [9, 487, 611, 477, 578]]
exam_score_df = pd.DataFrame(student_data, columns = ['student_ID', 'LANG', 'MAT', 'HIST', 'SCI'])
exam_score_df.set_index('student_ID')
program_data = [[101, 20, 30, 25, 25, 'N'], [102, 40, 10, 50, 50, 'Y']]
school_weight_df = pd.DataFrame(program_data, columns = ['program_code', '%LANG','%MAT','%HIST','%SCI', 'Alternative_Score'])
school_weight_df.set_index('program_code', inplace = True)
Here's the series which are used to index the code below:
series_student_IDs = pd.Series(exam_score_df.index, inplace = True)
series_program_codes = pd.Series(school_weight_df.index, inplace = True)
CODE TO CREATE DF USING FUNCTION:
To create the df of all of the students' scores at each program, I used nested .apply()'s:
new_df = pd.DataFrame(series_student_IDs.apply(lambda x: series_program_codes.apply(lambda y: calc_score(x, y))))
I've already read several primers on optimizing code in pandas, including the very well-written Guide by Sofia Heisler. My primary concern, and reason for why I can't figure out how to vectorize this code, is that my function needs to act on indexes. I also have a secondary concern that, even if I do vectorize, there is this problem with np.dot on large matrices for which I would want to loop anyway.
Thanks for all the help! I have only been coding for a few months, so all the helpful comments are really appreciated.