1

TL;DR: I understand that .apply() is slow in pandas. However, I have a function that acts on indexes, which I cannot figure out how to vectorize. I want this function to act on two sets of parameters (500,000 and 1,500 long, respectively). I want to produce a dataframe with the first parameter as the row index, the second parameter as column names, and the cells containing the function's output for that particular row and column. As it stands it looks like the code will take several days to run. More details and minimal reproducible example below:


INPUT DATA:

I have a series of unique student IDs, which is 500,000 students long. I have a df (exam_score_df) that is indexed by these student IDs, containing each student's corresponding scores in math, language, history, and science.

I also have a series of school codes (each school code corresponds to a school), which is 1,500 schools long. I have a df (school_weight_df) that is indexed by school codes, containing the school's weights in math, language, history, and science that it uses to calculate a student's scores. Each row also contains a 'Y' or 'N' indexed 'Alternative_Score' because some schools allow you to take the best subject score between history and science to calculate your overall score.

FUNCTION I WROTE TO BE VECTORIZED:

def calc_score(student_ID, program_code):
'''
For a given student and program, returns students score for that program.
'''
if school_weight_df.loc[program_code]['Alternative_Score'] == 'N':
    
    return np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST', 'SCI']]), 
                  np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST','%SCI']]))

elif school_weight_df.loc[program_code]['Alternative_Score'] == 'Y':

    history_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST']]), 
                           np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST']]))
    
    science_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'SCI']]), 
                           np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%SCI']]))

    return max(history_score, science_score)

EXAMPLE DFs:

Here are example dfs for exam_score_df and school_weight_df:

student_data = [[3, 620, 688, 638, 688], [5, 534, 602, 606, 700], [9, 487, 611, 477, 578]]
exam_score_df = pd.DataFrame(student_data, columns = ['student_ID', 'LANG', 'MAT', 'HIST', 'SCI'])
exam_score_df.set_index('student_ID')

program_data = [[101, 20, 30, 25, 25, 'N'], [102, 40, 10, 50, 50, 'Y']]
school_weight_df = pd.DataFrame(program_data, columns = ['program_code', '%LANG','%MAT','%HIST','%SCI', 'Alternative_Score'])
school_weight_df.set_index('program_code', inplace = True)

Here's the series which are used to index the code below:

series_student_IDs = pd.Series(exam_score_df.index, inplace = True)
series_program_codes = pd.Series(school_weight_df.index, inplace = True)

CODE TO CREATE DF USING FUNCTION:

To create the df of all of the students' scores at each program, I used nested .apply()'s:

new_df = pd.DataFrame(series_student_IDs.apply(lambda x: series_program_codes.apply(lambda y: calc_score(x, y))))

I've already read several primers on optimizing code in pandas, including the very well-written Guide by Sofia Heisler. My primary concern, and reason for why I can't figure out how to vectorize this code, is that my function needs to act on indexes. I also have a secondary concern that, even if I do vectorize, there is this problem with np.dot on large matrices for which I would want to loop anyway.

Thanks for all the help! I have only been coding for a few months, so all the helpful comments are really appreciated.

shaha
  • 123
  • 9
  • 1
    https://stackoverflow.com/help/minimal-reproducible-example – Paul H Jan 17 '21 at 23:26
  • 1
    Could you explain at a very high level what are you trying to achieve? I see you have added that details but its mixed up with HOW you are trying to do it and thus confusing. Without any discussion on the how part, can you simply talk about the goal? – Akshay Sehgal Jan 18 '21 at 00:25
  • 1
    Goal: I want a df with rows for each student, and columns for each school, and cells with the score of a student in that school. Does that help? – shaha Jan 18 '21 at 00:34
  • It might be easier to 'vectorize' if you can split the rows into 2 groups based on the `school_weight_df.loc[program_code]['Alternative_Score']` value. `group_by` might do it for you. Then you might be able to perform a multi-row `dot`. At this point I wouldn't worry about the large_matrix problem you link. – hpaulj Jan 18 '21 at 00:41
  • A batch-wise approach for a batch of students with vectorized computation is what you can do. You don't need to do separate dot products. You can separate the multiply and reduce part and compute the reduce part separately for the 2 conditions you have with the Alternate score. Check my solution and explanation. @hpaulj do review and suggest as well, looking forward to your suggestions, thanks! – Akshay Sehgal Jan 18 '21 at 01:53

1 Answers1

1

Apply = bad, Double Apply = very bad

If you are going Numpy in the function, why not go Numpy all the way? You would still prefer a batch-wise approach since the overall matrix would take tons of memory. Check the following approach.

Each iteration took me 2.05 seconds on a batch of 5000 students on a low-end macbook pro. This means for 500,000 students, you can expect 200 seconds approx, which is not half bad.

I ran the following on 100000 students and 1500 schools which took me a total of 30-40 seconds approx.

  1. First I created a dummy set data: Exam scores (100,000 students, 4 scores), school weights (1500 schools, 4 weights) AND a boolean flag for which school has alternative as Y or N, Y==True, N==False

  2. Next, for a batch of 5000 students, I simply calculate the element-wise product of each of the 4 subjects between the 2 matrices using np.einsum. This gives me (5000,4) * (1500,4) -> (1500,5000,4). Consider this as the first part of the dot product (without the sum).

  3. The reason I do this is because this is a necessary step for both your conditions N or Y.

  4. Next, FOR N: I simply filter the above matrix based on alt_flag, reduce it (sum) over last axis and transpose to get (5000, 766), where 766 are the number of schools with alternative == N

  5. FOR Y:, I filter based on alt_flag and then I calculate the sum of the first 2 subjects (because they are common) and add those to the 3rd and 4th subject separately, take a max and return that as my final score. Post that a Transpose. This gives me (5000, 734).

  6. I do this for each batch of 5000, until I have appended all the batches and then simply np.vstack to get the final tables (100000, 766) and (100000, 734).

  7. Now I can simply stack these over axis=0 to get (100000, 1500) but if I want to map them to the IDs (student, schools), it would be easier to do it separately using pd.DataFrame(data, columns=list_of_schools_alt_Y, index=list_of_student_ids and then combine them. Read the last step for you.

  8. Last step is for you to perform since I don't have the complete dataset. Since the order of the indexes is retained through batch-wise vectorization, you can now simply map the 766 school IDs with N, 734 school IDs with Y, and 100000 student IDs, in the order they occur in your main dataset. Then simply append the 2 data frames to create a final (massive) dataframe.

  9. NOTE: you will have to change the 100000 to 500000 in the for loop, don't forget!!

import numpy as np
import pandas as pd
from tqdm import notebook

exam_scores = np.random.randint(300,800,(100000,4))
school_weights = np.random.randint(10,50,(1500,4))
alt_flag = np.random.randint(0,2,(1500,), dtype=bool) #0 for N, 1 for Y
batch = 5000

n_alts = []
y_alts = []

for i in notebook.tqdm(range(0,100000,batch)):
    scores = np.einsum('ij,kj->kij', exam_scores[i:i+batch], school_weights) #(1500,5000,4)

    #Alternative == N
    n_alt = scores[~alt_flag].sum(-1).T   #(766, 5000, 4) -> (5000, 766)

    #Alternative == Y
    lm = scores[alt_flag,:,:2].sum(-1)    #(734, 5000, 2) -> (734, 5000); lang+math
    h = scores[alt_flag,:,2]              #(734, 5000); history
    s = scores[alt_flag,:,3]              #(734, 5000); science
    y_alt = np.maximum(lm+h, lm+s).T      #(5000, 734)
    
    n_alts.append(n_alt)
    y_alts.append(y_alt)

final_n_alts = np.vstack(n_alts)
final_y_alts = np.vstack(y_alts)

print(final_n_alts.shape)
print(final_y_alts.shape)

enter image description here

Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51