43

I have a MySQL database. I store homes in the database and perform literally just 1 query against the database, but I need this query to be performed super fast, and that's to return all homes within a square box geo latitude & longitude.

SELECT * FROM homes 
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

How is the best way for me to store my geo data so that I can perform this query of displaying all home within the geolocation box the quickest?

Basically:

  • Am I using the best SQL statement to perform this query the quickest?
  • Does any other method exist, maybe not even using a database, for me to query the fastest way a result of homes within a boxed geolocation bounds?

In case it helps, I've include my database table schema below:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `address` varchar(128) collate utf8_unicode_ci NOT NULL,
  `city` varchar(64) collate utf8_unicode_ci NOT NULL,
  `state` varchar(2) collate utf8_unicode_ci NOT NULL,
  `zip` mediumint(8) unsigned NOT NULL,
  `price` mediumint(8) unsigned NOT NULL,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  PRIMARY KEY  (`home_id`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
) ENGINE=InnoDB  ;

UPDATE

I understand spatial will factor in the curvature of the earth but I'm most interested in returning geo data the FASTEST. Unless these spatial database packages somehow return data faster, please don't recommend spatial extensions. Thanks

UPDATE 2

Please note, no one below has truly answered the question. I'm really looking forward to any assistance I might receive. Thanks in advance.

HankW
  • 431
  • 1
  • 5
  • 6
  • UTM coordinates are a better choice - the world isn't flat, but UTM incorporates a degree of flattening while Lat/Long doesn't at all. – OMG Ponies Nov 28 '09 at 19:17
  • 1
    I also recommend reading about MySQL's spatial functionality: http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html – OMG Ponies Nov 28 '09 at 19:18
  • Postgres is another spatial capable db alternative, which I would recommend using rather than MySQL: http://www.postgresql.org/ – OMG Ponies Nov 28 '09 at 19:19
  • I understand spatial will factor in the curvature of the earth but I'm most interested in returning geo data the FASTEST – HankW Nov 28 '09 at 19:21
  • Even if the data is bad? – OMG Ponies Nov 28 '09 at 19:23
  • 1
    What do you mean, "data is bad"? My application is looking at data from typically no more than 3 miles by 3 miles wide. So the curvature of the earth does not factor in that much – HankW Nov 28 '09 at 19:26
  • Did you even look at the link that Ponies gave you for MYSQL Spatial? It is likely that spatial indexes will indeed help you fetch your data faster as well as be more correct. See http://dev.mysql.com/doc/refman/5.0/en/optimizing-spatial-analysis.html (which is a direct link from the page above) – Donnie Nov 28 '09 at 19:36
  • 1
    ALL, SPATIAL IS NOT FASTER. I am using InnoDB. Per the documentation "InnoDB tables do not support spatial data types before MySQL 5.0.16. As of 5.0.16, InnoDB supports spatial data types, but not indexes on them. " http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html Why do you keep recommending SPATIAL when the documentation you are providing me to read, and claim that I'm not reading but AM, clearly states that for InnoDB database - not having an index will case for slower queries. Again, my question in hand is HOW DO I PERFORM THE QUICKEST QUERY FOR GEO DATA ? – HankW Nov 28 '09 at 19:40
  • 1
    I don't want to sound rude but it's frustrating when you claim I'm not reading the documentation you link too and when I read it, it clearly state the OPPOSITE of what you claim. It makes me think, are you even reading the documentation you are linking too yourself? – HankW Nov 28 '09 at 19:42
  • 2
    UTM would be awkward unless the area of interest is less than approximately 6 degrees of longitude wide and preferably only one side of the equator. If the area is wider than this, you need to specify a zone and the coordinates will be discontinuous across zone boundaries. At the equator, the y coordinate approaches zero from the north, but 10000000 when approached from the south. For areas with large extent in both latitude and longitude, the easiest coordinate system is latitude and longitude. You just have to accept the issues that come with spherical coordinates. – Mark Thornton Nov 28 '09 at 20:27
  • HankW could you change your question topic to include your performance needs somehow to make your goal more explicit? of course "best" depends on the specific situation :) might also generate more views. – tosh Nov 28 '09 at 20:47
  • I have no idea whether this is faster than what you already do or not (I got here by accident), but Google Maps has an example of creating a shop locator that seems to do exactly what you are doing: http://code.google.com/apis/maps/articles/phpsqlsearch.html#findnearsql Maybe it helps. – Pekka Nov 30 '09 at 03:22

11 Answers11

14

There is a good paper on MySQL geolocation performance here.

EDIT Pretty sure this is using fixed radius. Also I am not 100% certain the algorithm for calculating distance is the most advanced (i.e. it'll "drill" through Earth).

What's significant is that the algorithm is cheap to give you a ball park limit on the number of rows to do proper distance search.


The algorithm pre-filters by taking candidates in a square around the source point, then calculating the distance in miles.

Pre-calculate this, or use a stored procedure as the source suggests:

# Pseudo code
# user_lon and user_lat are the source longitude and latitude
# radius is the radius where you want to search
lon_distance = radius / abs(cos(radians(user_lat))*69);
min_lon = user_lon - lon_distance;
max_lon = user_lon + lon_distance;
min_lat = user_lat - (radius / 69);
max_lat = user_lat + (radius / 69);
SELECT dest.*,
  3956 * 2 * ASIN(
    SQRT(
      POWER(
        SIN(
          (user_lat - dest.lat) * pi() / 180 / 2
        ), 2
      ) + COS(
        user_lat * pi() / 180
      ) * COS(
        dest.lat * pi() / 180
      ) * POWER(
        SIN(
          (user_lon - dest.lon) * pi() / 180 / 2
        ), 2
      )
    )
  ) as distance
FROM dest
WHERE 
  dest.lon between min_lon and max_lon AND
  dest.lat between min_lat and max_lat
HAVING distance < radius
ORDER BY distance
LIMIT 10
Sumurai8
  • 20,333
  • 11
  • 66
  • 100
Igor Zevaka
  • 74,528
  • 26
  • 112
  • 128
  • It appears that using the stored procedure on slide #14 is promising but it's unclear to me if that assumes a fixed radius. Do you know if the radius is fixed or not? I want to be able to pass in the box corner (radius) – HankW Nov 30 '09 at 03:36
  • I need to be able to pass in as an argument the boxed radius. Do you think I can then use the linked document as such then? – HankW Nov 30 '09 at 04:21
6

I had the same problem, and wrote a 3 part blogpost. This was faster than the geo index.

Intro, Benchmark, SQL

Evert
  • 93,428
  • 18
  • 118
  • 189
  • 1
    Evert, how do you implemented Morton (z-value)? You're second article just jumps in and says nothing about how you computed the value – HankW Nov 29 '09 at 04:07
  • 1
    The third one does actually. There's a stored procedure – Evert Nov 29 '09 at 13:34
  • What I don't understand is that when I perform the SELECT, how do I know what the morton value is to select on? – HankW Nov 29 '09 at 16:44
  • Good question. You should make sure that for every row in your table, you also store this morton value. You can do this with an AFTER INSERT (along with AFTER UPDATE). When you SELECT you can simply do a BETWEEN getGeoMorton(lat1,lng1) AND getGeoMorton(lat2,lng2). Because the morton select will be an approximation, and can include a lot of items outside the area, you must also add a standard where clause for just the latitude and longitude bounding box. The real trick is that you are now using a BTREE for much smaller areas rather than JUST the latitude or longitude. – Evert Dec 01 '09 at 09:14
  • And that's why SO answers should include relevant quotes... links are dead. – Stijn de Witt Jun 06 '16 at 20:37
  • Criticism still partly stands though; you should quote a few relevant sections from your post. At least mention Morton Number or something so that if/when those links die again, we can at least search for info. – Stijn de Witt Jun 07 '16 at 19:49
  • I don't disagree, but I also don't care enough. This answer is from '09. – Evert Jun 08 '16 at 01:19
2

If you really need to go for performance you can define bounding boxes for your data and map the pre-compute bounding boxes to your objects on insertion and use them later for queries.

If the resultsets are reasonably small you could still do accuracy corrections in the application logic (easier to scale horizontal than a database) while enabling to serve accurate results.

Take a look at Bret Slatkin's geobox.py which contains great documentation for the approach.

I would still recommend checking out PostgreSQL and PostGIS in comparison to MySQL if you intend to do more complex queries in the foreseeable future.

tosh
  • 5,222
  • 2
  • 28
  • 34
  • 1
    And this is exactly why we should not use links on StackOverflow. Your link is broken. –  Aug 05 '14 at 16:47
  • 1
    @Sandor thanks for letting me know, I've adapted the answer and removed the dead link. – tosh Aug 18 '14 at 18:25
1

A very good alternative is MongoDB with its Geospatial Indexing.

Darshana
  • 2,462
  • 6
  • 28
  • 54
jalogar
  • 1,604
  • 19
  • 27
1

Here's a trick I've used with some success is to create round-off regions. That is to say, if you have a location that's at 36.12345,-120.54321, and you want to group it with other locations which are within a half-mile (approximate) grid box, you can call its region 36.12x-120.54, and all other locations with the same round-off region will fall in the same box.

Obviously, that doesn't get you a clean radius, i.e. if the location you're looking at is closer to one edge than another. However, with this sort of a set-up, it's easy enough to calculate the eight boxes that surround your main location's box. To wit:

[36.13x-120.55][36.13x-120.54][36.13x-120.53]
[36.12x-120.55][36.12x-120.54][36.12x-120.53]
[36.11x-120.55][36.11x-120.54][36.11x-120.53]

Pull all the locations with matching round-off labels and then, once you've got them out of the database, you can do your distance calculations to determine which ones to use.

Ben
  • 19
  • 1
1

The indices you are using are indeed B-tree indices and support the BETWEEN keyword in your query. This means that the optimizer is able to use your indices to find the homes within your "box". It does however not mean that it will always use the indices. If you specify a range that contains too many "hits" the indices will not be used.

Peter Lindqvist
  • 10,122
  • 3
  • 41
  • 60
1

Sticking with your current approach there is one change you should make, Rather than indexing geolat and geolong separately you should have a composite index:

KEY `geolat_geolng` (`geolat`, `geolng`),

Currently your query will only be taking advantage of one of the two indexes.

Ben
  • 1,531
  • 1
  • 11
  • 9
1

Since MySQL 5.7 mysql can use geoindex like ST_Distance_Sphere() and ST_Contains() wich improve performances.

Anak1
  • 31
  • 5
0

This looks pretty fast. My only concern would be that it would use an index to get all the values within 3 miles of the latitude, then filter those for values within 3 miles of the longitude. If I understand how the underlying system works, you can only use one INDEX per table, so either the index on lat or long is worthless.

If you had a large amount of data, it might speed things up to give every 1x1 mile square a unique logical ID, and then make an additional restriction on the SELECT that (area="23234/34234" OR area="23235/34234" OR ... ) for all the squares around your point, then force the database to use that index rather than the lat and long. Then you'll only be filtering much less square miles of data.

Christopher Gutteridge
  • 4,425
  • 2
  • 21
  • 20
0

Homes? You probably won't even have ten thousand of them. Just use an in-memory index like STRTree.

novalis
  • 1,112
  • 6
  • 12
0

You might consider creating a separate table 'GeoLocations' that has a primary key of ('geolat','geolng') and has a column that holds the home_id if that particular geolocation happens to have a home. This should allow the optimizer to search for a range of geo locations that will be sorted on disk for a list of home_ids. You could then perform a join with your 'homes' table to find information about those home_ids.

CREATE TABLE IF NOT EXISTS `GeoLocations` (
`geolat` decimal(10,6) NOT NULL,
`geolng` decimal(10,6) NOT NULL,
`home_id` int(10) NULL
PRIMARY KEY  (`geolat`,`geolng`)
);

SELECT GL.home_id
FROM GeoLocations GL
INNER JOIN Homes H
 ON GL.home_id = H.home_id
WHERE GL.geolat between X and Y
 and GL.geolng between X and Y