5

I have a group of users. The user count could be 50 or could be 2000. Each should have a long/lat that I have retrieved from Google Geo api.

I need to query them all, and group them by proximity and a certain count. Say the count is 12 and I have 120 users in the group. I want to group people by how close they are (long/lat) to other people. So that I wind up with 10 groups of people who are close in proximity.

I currently have the google geo coding api setup and would prefer to use that.

TIA.

-- Update I have been googling about this for awhile and it appears that I am looking for a spatial query that returns groups by proximity.

TJ Sherrill
  • 2,465
  • 7
  • 50
  • 88

3 Answers3

6

Keep in mind that this problem grows exponentially with every user you add, as the amount of distance calculations is linked to the square of the number of users (it's actually N*(N-1) distances... so a 2000 user base would mean almost 4 million distance calculations on every pass. Just keep that in mind when sizing the resources you need

Are you looking to group them based on straight-line (actually great circle) distance or based on walking/driving distance?

If the former, the great circle distance can be approximated with simple math if you're able to tolerate a small margin of error and wish to assume the earth is a sphere. From GCMAP.com:

Earth's hypothetical shape is called the geoid and is approximated by an ellipsoid or an oblate sphereoid. A simpler model is to use a sphere, which is pretty close and makes the math MUCH easier. Assuming a sphere of radius 6371.2 km, convert longitude and latitude to radians (multiply by pi/180) and then use the following formula:

theta = lon2 - lon1
dist = acos(sin(lat1) × sin(lat2) + cos(lat1) × cos(lat2) × cos(theta))
if (dist < 0) dist = dist + pi
dist = dist × 6371.2

The resulting distance is in kilometers.

Now, if you need precise calculations and are willing to spend the CPU cycles needed for much complex math, you can use Vincenty's Formulae, which uses the WGS-84 reference ellipsoid model of the earth which is used for navigation, mapping and whatnot. More info HERE

As to the algorithm itself, you need to build a to-from matrix with the result of each calculation. Each row and column would represent each node. Two simplifications you may consider:

  1. Distance does not depend on direction of travel, so $dist[n][m] == $dist[m][n] (no need to calculate the whole matrix, just half of it)
  2. Distance from a node to itself is always 0, so no need to calculate it, but since you're intending to group by proximity, to avoid a user being grouped with itself, you may want to always force $dist[m][m] to an arbitrarily defined and abnormally large constant ($dist[m][m] = 22000 (miles) for instance. Will work as long as all your users are on the planet)

After making all the calculations, use an array sorting method to find the X closest nodes to each node and there you have it (you may or may not want to prevent a user being grouped on more than one group, but that's just business logic)

Actual code would be a little too much to provide at this time without seeing some of your progress first, but this is basically what you need to do algoritmically.

Javier Larroulet
  • 3,047
  • 3
  • 13
  • 30
  • thanks this is much closer to what I am looking for. And thanks for working inside the language I OP'd. I am 100% fine with a little inaccuracy. Any ways to simplify, lower the number of calculations is better. This is general grouping not precise grouping. Any feedback? – TJ Sherrill Sep 14 '18 at 17:50
  • Well it'll depend greatly on how "compact" are your user's positions, you might get creative and infer some cases in which not every node needs to be calculated. For instance, given 3 nodes A, B and C if you already know A and B are close together (`$dist[A][B]` is small) and B and C are very far apart (`$dist[B][C]` is huge) you may infer with a decent degree of certainty that A and C will be far apart too and thus skip that calculation. That's just an idea, the actual algorythm may need a lot of fine tuning to avoid false positives and false negatives – Javier Larroulet Sep 14 '18 at 18:36
  • oh and whatever you do, you'll need to do this in some sort of recursive way, otherwise the processing time will grow even faster than the amount of combinations that require calculation – Javier Larroulet Sep 14 '18 at 18:37
  • Ok this is helpful. Way over my head but helpful none the less. could I PM you somehow? – TJ Sherrill Sep 14 '18 at 21:21
  • 1
    Typically, you don't solve these problems with an n-squared order approach, There are clustering algorithms such as the one rf1234 provided. You can also, for example, reduce the problem by grouping them into a small number of subdividions. This gives you a min and max distances between any two points where the difference is no greater than twice the maximum dimension of your subdivision, and requiring no more than n + M-squared (where M is the number of your subdivisions) distance calculations. So more like 12000 (if you use a 10x10 grid). You make the initial decisions there and refine... – wordragon Sep 17 '18 at 05:17
  • I agree. The thing is that clustering algorithms would still require some calculations to be done in this case considering that we are trying to group nodes by the distance between each other and not by the distance from each node to a fixed point in space, which would make things a bit easier. There's some heuristics that could be applied (by defining a fixed arbitrary point in space to aid in clustering) but I think that should be for phase 2 of the OP's work. Baby steps :) – Javier Larroulet Sep 25 '18 at 17:59
5

... it appears that I am looking for a spatial query that returns groups by proximity. ...

You could use hdbscan. Your groups are actually clusters in hdbscan wording. You would need to work with min_cluster_size and min_samples to get your groups right.

https://hdbscan.readthedocs.io/en/latest/parameter_selection.html

https://hdbscan.readthedocs.io/en/latest/

It appears that hdbscan runs under Python.

Here are two links on how to call Python from PHP: Calling Python in PHP, Running a Python script from PHP

Here is some more information on which clustering algorithm to choose: http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb

http://scikit-learn.org/stable/modules/clustering.html#clustering

rf1234
  • 1,510
  • 12
  • 13
3

Use GeoHash algorithm[1]. There is a PHP implementation[2]. You may pre-calculate geohashes with different precision, store them in SQL database alongside lat-lon values and query using native GROUP BY.

  1. https://en.wikipedia.org/wiki/Geohash
  2. https://github.com/lvht/geohash
SergeAx
  • 542
  • 3
  • 14