1

There are these 2 tables. (A) contains locations defined by lat/long data. (B) is a city dictionary with the centroid of every city defined by lat/long data. The task is to find the nearest city (found in B) for every location (found in A).

Is there a way by which A can be updated at once, via, say, a computed JOIN (such as min absolute difference between A.lat and B.lat + A.long and B.long approach)?

For now I process table A row by row, but this is time consuming.

How would you go about it?

user3127882
  • 482
  • 5
  • 12
  • 1
    If you join, MySQL will do exactly that too: do the time consuming one row calculation for every row. Joining will not make it faster. You will have to optmiize your code for one row (e.g. using a [spacial index](http://stackoverflow.com/questions/36026887/find-nearest-points-with-mysql-from-points-table)). You still might want to do it in batches, as it still takes a while; and you can try to prefilter: e.g. locations that are close to Paris, France, will be closest to a city that is close to Paris, not to cities that are close to Washington D.C., so you can save some distance calculations. – Solarflare Feb 21 '17 at 03:21
  • 1
    One approach is to iteratively try to join on exact matches on lat/long using progressively rounded values. i.e. first attempt is to match locations with a city centroid using 6-decimal accuracy, then using rounded at 5 decimals, then at 4, etc. A test run on 2M+ locations runs at least 3 orders of magnitude faster than iteratively computing the closest match. Results differ but it is not clear which method is better. Matching on city centroids yields poor results when a location in on the periphery of a large city surrounded by small localities. – user3127882 Feb 21 '17 at 04:35
  • Yes, that would be one way to do the prefilter part. Lat/Long of Paris and Marseille will have closer rounded than lat/long of Marseille and Washington, DC. Just take care about how you round, 100.94° can be closer to 101.01° than to 100.86°, also 1° at the north pole will be a bigger distance in "km" than 1° at the equator. Still you should calculate the actual distance after prefiltering possible candidates with lan/long-similarity, which is basically using a rectangle around a city; since the expensive calculation is the distance, still use the gis-functions, as they are way faster. – Solarflare Feb 21 '17 at 14:15
  • 1
    hmmm... matching on rounded lat/long would not be prefiltering. Current data comes with a precision of 6 decimals (roughly 6 inches / 10 cms). So instead of computing (all) distances between locations and city centroids at absurd accuracies, one way is to find if there is an exact match at lower resolutions. Giving some more weight (i,e, lower resolution) to large cities appears to be improving the whole process. But the key difference is that exact matches at various resolutions in MUCH faster than computing every distances. (will eventually post a more complete answer). – user3127882 Feb 21 '17 at 15:40
  • 1
    If it works for you it's fine. Then you are basically "prefiltering" down to one possible candidate and just take it. But as soon as you get more than one candidate (which will happen at higher "resolutions", and which you may not see if you e.g. do something like `update ... join on roundfct() = roundfct()`, because this will simply take one of all possible candidates without complaining), you still need to calculate the exact distance (but just for these prefiltered candidates). That's why I called it prefiltering. But as I said, if it works, it's fine. – Solarflare Feb 21 '17 at 15:55
  • Would you mind posting a more complete answer with SQL code? I am solving exactly this problem. – fmalina Dec 19 '19 at 21:53
  • It's been a while and the code has been deleted once the job done. The recipe was running with a mix of SQL and python. As indicated above, the "trick" is to narrow the field of candidate cities by sub-selecting on less accurate lat/long values. IIRC, we truncated lat long at 2 decimals, then 1. Tried exact match (JOIN on truncated lat long). Removed matched and retried with "approximate" matches (i.e. +/- 1, 2, 3, etc.). If 2 or more candidate cities are found, compute distance and select closest centroid. Far from perfect, but fast. Did it for us. – user3127882 Dec 21 '19 at 05:12

0 Answers0