2

Correct me if I'm wrong.

There are three approaches to get the nearest homes, users have created in my website:

  1. To create a table with two columns(latitude, longitude) that both of them are float and say:

Here it is:

$latitude = 50;
$longitude = 60;

SELECT * FROM my_table
    WHERE (latitude  <= $latitude+10  AND latitude  >= $latitude-10)
      AND (longitude <= $longitude+10 AND longitude >= $longitude-10)

that 10 here means 1km for example.

In this approach we can also use harvesine formula.

  1. To merge those columns(latitude, longitude) to one column named point as POINT type and again search each row one by one.

  2. To categorize multiple points(the coordinates of homes users have created) as a category for one section of a country i.e. city and if a query comes with $latitude and $longitude to see the nearest homes, I will check in which category they are stored IN ORDER NOT TO search all rows but search only the section this query(coordinate) belongs to.

As I guess approach number 1 is slow because of the conditions for each row of table and again slow if I use harvesine formula.

If I use ST_Distance it seems again it's slow because again it just has lots of calculations.

But if I use approach number 3 it seems it is faster to check each section for an specific point user is than check all rows. I know how to set point for each home however I don't know how to create multiple home positions as a section maybe in another table.

BTW in new versions of MySQL and MariaDB Spatial Indexes are supported in InnoDB.

My questions:

  1. Is approach number 1 really slow or other ST_* functions are the same as this approach to check all rows with those formulas mentioned there one by one? Which one is faster?

  2. Does approach number 2 do something other than simple conditions to make it faster? I mean does it make any changes when using type of POINT instead of float and using ST_* functions instead of doing it myself? I want to know whether the algorithm is different.

  3. If approach number 3 is the fastest in these three approaches, how can I categorize points in order not to search all rows in a table?

  4. How can I use Spatial Indexes to make it as fast as possible?

  5. If any other approaches exist and I didn't mention, could you please tell me how can I get the nearest homes just by having coordinates in MySQL/MariaDB in PHP/Laravel?

Thanks All

Rick James
  • 135,179
  • 13
  • 127
  • 222
kodfire
  • 1,612
  • 3
  • 18
  • 57
  • Check the last part of my answer [here](https://stackoverflow.com/a/50668333/5563083) – Paul Spiegel Jul 19 '18 at 21:04
  • How many "homes" will you have in the dataset? There are a few billion in the world, but I doubt if there is data on most of them. – Rick James Jul 20 '18 at 01:00
  • @RickJames Maybe 2, 3 thousands right now but it is growing and will reach maybe to millions. – kodfire Jul 21 '18 at 05:15
  • @PaulSpiegel Your link good but again I got some questions here: 1. If I use only Spatial Index in my homes table, will this help me without third table of cities that have boundaries for each city to be searched? 2. You said MBRWithin or MBRContains, but what about st_contains, st_within and st_distance_sphere? 3. You said 'Increase the size of the polygon in a loop until it contains at least 5 locations.' but what if it doesn't find for hundred of times? Doesn't the loop have bad effect on performance? 4. Isn't it a good idea to search nearests by st_distance_sphere? Isn't it fast enough? – kodfire Jul 21 '18 at 06:17
  • @kodfire using only a st_distance function requires checking all of your "millions" of rows. _That_ is too costly. Growing a bounding box (non-Spatial) or a polygon (Spatial) will keep the effort in check, as I discuss in my answer. – Rick James Jul 21 '18 at 16:15
  • If the earth were flat then you could just get all points within the circle with your location as center and radius of 10km. It's probably not accurate to do this but it's probably a lot faster because it's easy to determine the range of longtitudes and latitudes you are interested in and with an index its practically logarithmic time. – apokryfos Jul 24 '18 at 15:59

2 Answers2

4

Which formula you use for the distance doesn't matter much. What matters much more is the number of rows which you have to read, process and sort. In best case you can use an index for a condition in the WHERE clause to limit the number of processed rows. You can try to categorize your locations - But it depends on the nature of your data, if that is going to work well. You would also need to find out which "category" to use. A more general solution would be to use a SPATIAL INDEX and the ST_Within() function.

Now let's run some tests..

In my DB (MySQL 5.7.18) I have the following table:

CREATE TABLE `cities` (
    `cityId` MEDIUMINT(9) UNSIGNED NOT NULL AUTO_INCREMENT,
    `country` CHAR(2) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `city` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `accentCity` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `region` CHAR(2) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
    `population` INT(10) UNSIGNED NULL DEFAULT NULL,
    `latitude` DECIMAL(10,7) NOT NULL,
    `longitude` DECIMAL(10,7) NOT NULL,
    `geoPoint` POINT NOT NULL,
    PRIMARY KEY (`cityId`),
    SPATIAL INDEX `geoPoint` (`geoPoint`)
) COLLATE='utf8mb4_unicode_ci' ENGINE=InnoDB

The data comes from Free World Cities Database and contains 3173958 (3.1M) rows.

Note that geoPoint is redundant and equal to POINT(longitude, latitude).

Concider the user is located somewhere in London

set @lon = 0.0;
set @lat = 51.5;

and you want to find the nearest location from the cities table.

A "trivial" query would be

select c.cityId, c.accentCity, st_distance_sphere(c.geoPoint, point(@lon, @lat)) as dist
from cities c
order by dist
limit 1

The result is

988204 Blackwall 1085.8212159861014

Execution time: ~ 4.970 sec

If you use the less complex function ST_Distance(), you get the same result with an execution time of ~ 4.580 sec - which is not so much difference.

Note that you don't need to store a geo point in the table. You can as good use (point(c.longitude, c.latitude) instead of c.geoPoint. To my surprise it is even faster (~3.6 sec for ST_Distance and ~4.0 sec for ST_Distance_Sphere). It might be even faster if I didn't have a geoPoint column at all. But that still doesn't matter much, since you don't want the user to wait so log for a respose, if you can do better.

Now let's look how we can use the SPATIAL INDEX with ST_Within().

You need to define a polygon which will contain the nearest location. A simple way is to use ST_Buffer() which will generate a polygon with 32 points and is nearly a circle*.

set @point = point(@lon, @lat);
set @radius = 0.1;
set @polygon = ST_Buffer(@point, @radius);

select c.cityId, c.accentCity, st_distance_sphere(c.geoPoint, point(@lon, @lat)) as dist
from cities c
where st_within(c.geoPoint, @polygon)
order by dist
limit 1

The result is the same. The execution time is ~ 0.000 sec (that's what my client (HeidiSQL) says).

* Note that the @radius is notated in degrees and thus the polygon will be more like an ellipse rather than a circle. But in my tests I always got the same result as with the simple and slow solution. I would though investigate more edge cases, before I use it in my production code.

Now you need to find the optimal radius for your application/data. If it's too small - you might get no results, or miss the nearest point. If it's too big - you might need to process too many rows.

Here some numbers for the given test case:

  • @radius = 0.001: No result
  • @radius = 0.01: exactly one location (kind of lucky) - Execution time ~ 0.000 sec
  • @radius = 0.1: 55 locations - Execution time ~ 0.000 sec
  • @radius = 1.0: 2183 locations - Execution time ~ 0.030 sec
Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53
  • 1. What is '988204 Blackwall 1085.8212159861014'? 2. Could you tell me how index works? as I understand it is like array index that it doesn't calculate if some column is bigger than some value then show this BUT it gets the row directly from the point that we want. But when there are lots of data in a POINT type column, how does it get that directly while there is a need to calculate between I and the nearest points? 3. You have used SPATIAL INDEX for geoPoint but what if we have simple index for both `lat` ant `lon` columns? – kodfire Jul 24 '18 at 08:48
  • 4. By these results I got that just by using SPATIAL INDEX it got faster, right? 5. What if we don't use ST_BUFFER and use four points from mobile device(x1, y1, x2, y2) for polygon? – kodfire Jul 24 '18 at 08:49
  • 6. Lastly does this scenario fast enough: We have a table named `sections`. In this table we have `id`, `cat_id` and `geopoint`. When user submits his home at first we get in which section his home is, for example it is in a place you mentioned(London), and for example London is in 3rd category so `cat_id` is 3 so in `homes` table we have `id`, `cat_id` and `lat` and `lon` (as you mentioned it is faster than `geoPoint`) and when the user wants to find nearest homes, he presents his current point(lat, lon) and sql will firsly find in which category he is in `sections` table and then will fetch – kodfire Jul 24 '18 at 08:50
  • the homes from `homes` table by `cat_id` in order not to search all records. So I. Is this method fast enough? II. If yes which ones should get index attribute? and what changes should this have in order to be faster if possible? THANKS :) – kodfire Jul 24 '18 at 08:50
  • @kodfire It's the result of the query - ID and name of the nearest location (from my table) and the calculated distance. You can find *Blackwall* on [google maps](https://www.google.com/maps/place/Blackwall/@51.5076028,-0.0075512,18z). I can't explain how an R-Tree index works. Don't try to compare it with an array. Read the [wikipedia article](https://en.wikipedia.org/wiki/R-tree) to get an idea. I show you a way how to find the nearest location/POI quite efficiently using a SPATIAL INDEX. Answering all your questions would be far out scope of my answer. – Paul Spiegel Jul 24 '18 at 16:36
  • OK Thanks @PaulSpiegel – kodfire Jul 24 '18 at 19:32
  • 1
    @lmao asked about the distance unit [here](https://stackoverflow.com/q/54790435/5563083). [`ST_Distance_Sphere`](https://dev.mysql.com/doc/refman/5.7/en/spatial-convenience-functions.html#function_st-distance-sphere) returns a result in meters. [`ST_Distance`](https://dev.mysql.com/doc/refman/5.7/en/spatial-relation-functions-object-shapes.html#function_st-distance) returns a result with same units as used for the points. In case of lat and lon it's degree. See [demo](https://www.db-fiddle.com/f/pawWCBCMM8XbzrHWZ3s9vR/0). – Paul Spiegel Feb 20 '19 at 23:37
1

Bounding Box and Haversine

In your brief SELECT, you are using the "bounding box" approach, wherein a crude square is drawn on the map. It, however, has a couple of flaws.

  • the 50 and 60 are presumably in degrees; you say the 10 is in km. You can't mix them with out converting one or the other.
  • longitude degrees are shorter than latitude degrees; a cos() is needed to fix this.

Having these helps the bounding box, which filters the rows significantly, then the optional haversine test rounds the reach of the test.

INDEX(latitude)
INDEX(longitude)

This approach has "medium" performance -- One of the indexes will be used with the bounding box, thereby quickly limiting the candidates to an east-west (or north-south) stripe around the globe. But that may still be a lot of candidates.

By having filtered out most of the rows, the number of Haversine calls is not too bad; don't worry about the performance of the function.

If you have one million homes, the final bounding box that contains 5 homes (plus a few that fail the haversine check) will probably involve touching a few thousand rows -- due to using only one of the two indexes. This is still much better than fetching all million rows and check each one with the distance function.

POINT and SPATIAL index

Switching to POINT requires switching to a SPATIAL index. In this mode, ST_Distance_Sphere() is available instead of the haversine. (Caution: that function exists only in very recent versions.)

By having filtered out most of the rows, the number of calls to ST_Distance or ST_Distance_Sphere is not too bad; don't worry about the performance of the function.

SPATIAL searches use R-Trees. I do not have a good feel for their performance in your query.

Approach 3

By starting with another categorization of points, you add complexity. You also add the need to check adjacent regions to see if there are nearby points. I can't judge the relative performance without more details.

My Approach

I have some complex code that scales to arbitrarily many points. Since your dataset is probably small enough to be cached in RAM, it may be overkill for you. http://mysql.rjweb.org/doc.php/latlng

For only a million homes, the pair of indexes above might be "good enough" so that you don't need to resort to "my algorithm". My algorithm will touch only about 20 rows to get the desired 5 -- regardless of the total number of rows.

Other Notes

If you store both lat/lng and POINT, the table will be bulky; keep this in mind if trying to mix bounding boxes and ST functions.

Rick James
  • 135,179
  • 13
  • 127
  • 222