I have 2 quite large dataframes (>>100,000,000 of rows). I would like to calculate a column sum_distance in first dataframe based on data in second dataframe
first dataframe df1 (simplified and should be readable with
pd.read_clipboard(sep='\s\s+')
):
vehicle month_start month_end
0 577-1 201201 201203
1 577-1 201203 201207
2 577-1 201203 201207
3 577-1 201203 201207
4 577-1 201204 201207
5 577-1 201205 201207
6 577-1 201206 201207
7 577-1 201206 201209
8 577-1 201207 201209
9 577-1 201207 201210
second dataframe df2:
vehicle Year Month Distance
0 577-1 2012 1 13680.0
1 577-1 2012 2 13295.0
2 577-1 2012 3 13010.0
3 577-1 2012 4 16267.0
4 577-1 2012 5 11825.0
5 577-1 2012 6 9739.0
6 577-1 2012 7 14290.0
7 577-1 2012 8 14117.0
8 577-1 2012 9 10687.0
9 577-1 2012 10 14504.0
I am looking for efficient way how to calculate column in df1 "sum_distance" (sum of distance for rows in df2 where year/month is between month_start and month_end), if possible without looping in df1. What would be the most efficient way?
expected output df1
vehicle month_start month_end sum_distance
0 577-1 201201 201203 26975.0
1 577-1 201203 201207 50841.0
2 577-1 201203 201207 50841.0
3 577-1 201203 201207 50841.0
4 577-1 201204 201207 37831.0
5 577-1 201205 201207 21564.0
6 577-1 201206 201207 9739.0
7 577-1 201206 201209 38146.0
8 577-1 201207 201209 28407.0
9 577-1 201207 201210 39094.0