I have one data frame of around 2.5 million records. I also have a dictionary of around 10,000 records. Now, I have a scenario, where I have to loop through every record of the data frame and compare values of one column which is a list of values with another list which is a part of the dictionary and get the first two values out of it(except the values which current list has).
Now, this is below dummy data.
# originally this dictionary has around 10,000 records
temp = {
0: {
'AT_1': ['AN_1_2', 'AN_1_0', 'AN_1_1', 'AN_1_3'],
'AT_2': ['AN_2_1'],
'AT_3': ['AN_3_0', 'AN_3_1']
},
1: {
'AT_1': ['AN_1_1', 'AN_1_0'],
'AT_2': ['AN_2_0', 'AN_2_1'],
'AT_3': ['AN_3_1']
}
}
# originally this dataframe has around 2.5 million records
Name Group Points Subpoints Additional_Points
x 0.0 AT_1 [AN_1_0] NaN
y 1.0 AT_2 [AN_2_0, AN_2_1] NaN
z 1.0 AT_1 [AN_1_1] NaN
w 0.0 AT_1 NaN NaN
And, output should be like this:
# compare each row of value with dictionary on basis of 'Group' and 'Points', then take the first two values except the values present in 'Subpoints'.
Name Group Points Subpoints Additional_Points
x 0.0 AT_1 [AN_1_0] [AN_1_2, AN_1_1]
y 1.0 AT_2 [AN_2_0, AN_2_1] []
z 1.0 AT_1 [AN_1_1] [AN_1_0]
w 0.0 AT_1 NaN NaN
Now, this is piece of code I am using, which is working fine. But it is very much slow for 2.5 million records.
people['Additional_Points'] = people.apply(lambda x: [i for i in temp[x['Group']][x['Points']] if i not in x['Subpoints']][:2] if x.notna()['Subpoints'] else np.nan, axis=1)
Kindly let me know if there is any way to make this whole operation faster.