I am new to numpy/pandas and vectorized computation. I am doing a data task where I have two datasets. Dataset 1 contains a list of places with their longitude and latitude and a variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all the places in dataset 2 but I would only like to get a count of places in dataset 2 that are less than the value of variable A. Note also both of the datasets are very large, so that I need to use vectorized operations to expedite the computation.
For example, my dataset1 may look like below:
id lon lat varA
1 20.11 19.88 100
2 20.87 18.65 90
3 18.99 20.75 120
and my dataset2 may look like below:
placeid lon lat
a 18.75 20.77
b 19.77 22.56
c 20.86 23.76
d 17.55 20.74
Then for id == 1 in dataset1, I would like to calculate its distances to all four points (a,c,c,d) in dataset2 and I would like to have a count of how many of the distances are less than the corresponding value of varA. For example, the four distances calculated are 90, 70, 120, 110 and varA is 100. Then the value should be 2.
I already have a vectorized function to calculate distance between the two pair of coordinates. Suppose the function (haversine(x,y)) is properly implemented, I have the following code.
dataset2['count'] = dataset1.apply(lambda x:
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis
= 1)
However, this gives the total number of rows, but not the ones that satisfy my requirements.
Would anyone be able to point me how to make the code work?