6

I am new to numpy/pandas and vectorized computation. I am doing a data task where I have two datasets. Dataset 1 contains a list of places with their longitude and latitude and a variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all the places in dataset 2 but I would only like to get a count of places in dataset 2 that are less than the value of variable A. Note also both of the datasets are very large, so that I need to use vectorized operations to expedite the computation.

For example, my dataset1 may look like below:

id lon    lat   varA
1  20.11 19.88  100
2  20.87 18.65  90
3  18.99 20.75  120

and my dataset2 may look like below:

placeid lon lat 
a       18.75 20.77
b       19.77 22.56
c       20.86 23.76
d       17.55 20.74 

Then for id == 1 in dataset1, I would like to calculate its distances to all four points (a,c,c,d) in dataset2 and I would like to have a count of how many of the distances are less than the corresponding value of varA. For example, the four distances calculated are 90, 70, 120, 110 and varA is 100. Then the value should be 2.

I already have a vectorized function to calculate distance between the two pair of coordinates. Suppose the function (haversine(x,y)) is properly implemented, I have the following code.

dataset2['count'] = dataset1.apply(lambda x: 
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis 
= 1)

However, this gives the total number of rows, but not the ones that satisfy my requirements.

Would anyone be able to point me how to make the code work?

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
macintosh81
  • 189
  • 1
  • 12
  • 3
    as in vectorizing Haversine calculations? see the Related links to the right of your post... ie https://stackoverflow.com/questions/34502254/vectorizing-haversine-distance-calculation-in-python?rq=1 – NaN Aug 21 '17 at 21:23
  • can you post the haversine function your using? – DJK Aug 21 '17 at 22:52
  • @macintosh81 If my answer was useful, please consider accepting/up-voting it. – Alz Aug 29 '17 at 11:40

3 Answers3

3

If you can project the coordinates to a local projection (e.g. UTM), which is pretty straight forward with pyproj and generally more favorable than lon/lat for measurement, then there is a much much MUCH faster way using scipy.spatial. Neither of df['something'] = df.apply(...) and np.vectorize() are not truly vectorized, under the hood, they use looping.

ds1
    id  lon lat varA
0   1   20.11   19.88   100
1   2   20.87   18.65   90
2   3   18.99   20.75   120

ds2
    placeid lon lat
0   a   18.75   20.77
1   b   19.77   22.56
2   c   20.86   23.76
3   d   17.55   20.74


from scipy.spatial import distance

# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11,  19.88],
#       [ 20.87,  18.65],
#       [ 18.99,  20.75]])

distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074,  2.70148108,  3.95182236,  2.70059253],
#       [ 2.99813275,  4.06178532,  5.11000978,  3.92307278],
#       [ 0.24083189,  1.97091349,  3.54358575,  1.44003472]])

distances is in fact distance between every pair of points. coords_a.shape is (3, 2) and coords_b.shape is (4, 2), so the result is (3,4). The default metric for np.distance is eculidean, but there are other metrics as well. For the sake of this example, let's assume vara is:

vara = np.array([2,4.5,2])

(instead of 100 90 120). We need to identify which value in distances in row one is smaller than 2, in row two smaller that 4.5,..., one way to solve this problem is subtracting each value in vara from corresponding row (note that we must resize vara):

vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926,  0.70148108,  1.95182236,  0.70059253],
#       [-1.50186725, -0.43821468,  0.61000978, -0.57692722],
#       [-1.75916811, -0.02908651,  1.54358575, -0.55996528]])

then setting positive values to zero and making negative values positive will give us the final array:

res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926,  0.        ,  0.        ,  0.        ],
#            [ 1.50186725,  0.43821468,  0.        ,  0.57692722],
#            [ 1.75916811,  0.02908651,  0.        ,  0.55996528]])

Now, to sum over each row:

sum_ = res.sum(axis=1)
#out:  array([ 0.37466926,  2.51700915,  2.34821989])

and to count the items in each row:

count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])

This is a fully vectorized (custom) solution which you can tweak to your liking and should accommodate any level of complexity. yet another solution is cKDTree. the code is from documentation. it should be fairly easy to adopt it to your problem, but in case you need assistance don't hesitate to ask.

x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]

query_ball_point() finds all points within distance r of point(s) x, and it is amazingly fast.

one final note: don't use these algorithms with lon/lat input, particularly if your area of interest is far from equator, because the error can get huge.

UPDATE:

To project your coordinates, you need to convert from WGS84 (lon/lat) to appropriate UTM. To find out which utm zone you should project to use epsg.io.

lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)

Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)

You can do df.apply() and use Proj_to_... to project df.

Alz
  • 755
  • 6
  • 11
  • Hey, thanks so much for the input! It is very helpful. However, my raw data only has lon/lat measurements for all the points. In this case, can I use the two methods you suggest? Is there is a way for me to project the coordinates to a UTM projection schema? – macintosh81 Aug 22 '17 at 02:37
  • You can easily project lon/lat to utm using `pyproj`. I will update my answer to explain how. here is some useful material about it: [1](https://ocefpaf.github.io/python4oceanographers/blog/2013/12/16/utm/), [2](https://stackoverflow.com/questions/6778288/lat-lon-to-utm-to-lat-lon-is-extremely-flawed-how-come). – Alz Aug 22 '17 at 11:48
  • Hey, this is really helpful! One last question, if I use projection would that be more or less accurate than the haversine distance? My guess is that as long as I choose correct UTM zone, it should be very accurate right? – macintosh81 Aug 23 '17 at 03:59
  • @macintosh81 There are literally thousands of projections out there and there is not one projection which is suitable for all purposes, so you should be clear about how much accuracy you need, which features you need to preserve (e.g. distance, shape, area) and how big is the area you are dealing with. `UTM` is very good for measurement, preserves shape and directions, but the downside is that it only span six-degree longitudinal. If your area is limited (e.g. a city or region), then UTM is the way to go. Additionally, it is less computationally expensive that `haversine`. – Alz Aug 23 '17 at 22:19
  • Thanks so much! I am most interested in distance and my area is the whole country (India). However, I only calculate distances within cities in many different cities in India. Does that mean I need to use city-specific UTM or can I use just one UTM for all cities in India (as long as I dont compute across city distance)? There seem to be so many UTMs for India so I get a little bit confused. – macintosh81 Aug 23 '17 at 23:32
  • You have two options: 1. (recommended) use @Daniel's solution to construct distance matrix using `scipy.spatial.distance.cdist` with your user-defined distance algorithm (haversine). and then my code. 2. You cannot use one UTM for the whole country. divide your data into as many UTM zones as necessary, project each one to its appropriate projection, do calculation for each. (too much work, little added value) haversine is reasonably accurate. if you don't need accuracy in the order of 1 meter, don't over-complicate things. – Alz Aug 24 '17 at 07:40
1

IIUC:

Source DFs:

In [160]: d1
Out[160]:
   id    lon    lat  varA
0   1  20.11  19.88   100
1   2  20.87  18.65    90
2   3  18.99  20.75   120

In [161]: d2
Out[161]:
  placeid    lon    lat
0       a  18.75  20.77
1       b  19.77  22.56
2       c  20.86  23.76
3       d  17.55  20.74

Vectorized haversine function:

def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    if to_radians:
        lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])

    a = pd.np.sin((lat2-lat1)/2.0)**2 + \
        pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))

Solution:

x = d2.assign(x=1) \
      .merge(d1.loc[d1['id']==1, ['lat','lon']].assign(x=1),
             on='x', suffixes=['','2']) \
      .drop(['x'], 1)

x['dist']  = haversine(x.lat, x.lon, x.lat2, x.lon2)

yields:

In [163]: x
Out[163]:
  placeid    lon    lat   lat2   lon2        dist
0       a  18.75  20.77  19.88  20.11  172.924852
1       b  19.77  22.56  19.88  20.11  300.078600
2       c  20.86  23.76  19.88  20.11  438.324033
3       d  17.55  20.74  19.88  20.11  283.565975

filtering:

In [164]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[164]:
Empty DataFrame
Columns: [placeid, lon, lat, lat2, lon2, dist]
Index: []

let's change d1, so a few rows would satisfy the criteria:

In [171]: d1.loc[0, 'varA'] = 350

In [172]: d1
Out[172]:
   id    lon    lat  varA
0   1  20.11  19.88   350   # changed: 100 --> 350 
1   2  20.87  18.65    90
2   3  18.99  20.75   120

In [173]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[173]:
  placeid    lon    lat   lat2   lon2        dist
0       a  18.75  20.77  19.88  20.11  172.924852
1       b  19.77  22.56  19.88  20.11  300.078600
3       d  17.55  20.74  19.88  20.11  283.565975
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Hey, thanks for the input. It's definitely helpful! However, the code does not seem to create a count for each of id in dataset 1? Also, will this work fast for large dataset? – macintosh81 Aug 22 '17 at 02:56
1

Use scipy.spatial.distance.cdist with your user-defined distance algorithm as the metric

h = lambda u, v: haversine(u['lon'], u['lat'], v['lon'], v['lat'])
dist_mtx = scipy.spatial.distance.cdist(dataset1, dataset2, metric = h)

Then to check the number in the area, just broadcast it

dataset2['count'] = np.sum(dataset1['A'][:, None] > dist_mtx, axis = 0)
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Hi Thanks for the input! I encounter IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices when using the spatial.distance method. Is there anything that I did wrong? – macintosh81 Aug 22 '17 at 15:59