Duplicate Records in MySQL Geo Search Using Cross Join and Hversine Formula

Question

I'm trying to complete a modification of this Google tutorial

I've written this SQL to query a table of locations using the location "name." Given the name of the location, the query returns pizza restaurants within the proximity. To accomplish this I've cross joined my table of restaurant locations, titled "markers" to itself and calculated distances using the Haversine formula.

    SELECT m.address,
       m.name,
       m.lat,
       m.lng,
       (3959 * ACOS(COS(RADIANS(poi.lat)) * 
       COS(RADIANS(m.lat)) * 
       COS(RADIANS(m.lng) - RADIANS(poi.lng)) + SIN(RADIANS(poi.lat))*
       SIN(RADIANS(m.lat)))) AS distance
    FROM markers poi
       CROSS JOIN markers m
    WHERE poi.address LIKE "%myrtle beach%"
          AND poi.id <> m.id HAVING distance < 200
   ORDER BY distance LIMIT 0,20

The query returns the expected results, but if the point of interest is outside the specified area, in this case "myrtle beach," I get duplicate records per match. This is because the CROSS JOIN and would be easy to fix with a DISTINCT select. But the "lng" and "lat" fields are FLOAT types, so the distance calculations are never identical, even for duplicated records.

Here is a subset of the returns:

3901 North Kings Highway Suite 1, Myrtle Beach, SC | East of Chicago Pizza Company | 33.716099 -78.855583 | 4.0285562196955125

1706 S Kings Hwy # A, Myrtle Beach, SC | Domino's Pizza: Myrtle Beach | 33.674881 | -78.905144 | 4.0285562196955125

82 Wentworth St, Charleston, SC | Andolinis Pizza | 2.782330 | -79.934235 | 85.68177495224947

82 Wentworth St, Charleston, SC | Andolinis Pizza | 32.782330 | -79.934235 | 89.71000040441085

114 Jungle Rd, Edisto Island, SC | Bucks Pizza of Edisto Beach Inc | 32.503971 -80.297951 | 114.22243529200529

114 Jungle Rd, Edisto Island, SC | Bucks Pizza of Edisto Beach Inc | 32.503971 -80.297951 | 118.2509427998286"

Any suggestions on where to go from here?

Can you post sample input & output showing the dups? And for the sake of readability, I suggest an edit that presumes the [existence of a `Haversine` function](http://stackoverflow.com/questions/6385452/a-custom-mysql-function-to-calculate-the-haversine-distance). — bishop, Feb 22 '14 at 19:59
@Brandon Buster Can you provide a sample of your data. Additionally what do you mean by point of interest? The only manually entered data I see is %myrtle beach% . What are you trying to make? As it is this question is a bit difficult to understand and needs clarifications and more details. — Menelaos, Feb 22 '14 at 20:00
Sorry I used the term "point of interest" in a couple of different contexts. The point of interest in this example is only Myrtle Beach. The data I want back is pizza restaurants within a 200 mile radius of that point. I'll edit my post with some sample data and more info — Brandon Buster, Feb 22 '14 at 22:28

score 1 · Answer 1 · answered Feb 22 '14 at 20:00

Try:

select distinct x.address, x.name, y.lat, y.lng, x.distance
  from (SELECT m.address,
               m.name,
               m.lat,
               m.lng,
               (3959 *
               ACOS(COS(RADIANS(poi.lat)) * COS(RADIANS(m.lat)) *
                     COS(RADIANS(m.lng) - RADIANS(poi.lng)) +
                     SIN(RADIANS(poi.lat)) * SIN(RADIANS(m.lat)))) AS distance
          FROM markers poi
         cross JOIN markers m
         WHERE poi.address LIKE "%myrtle beach%"
           and poi.id <> m.id HAVING distance < 200) x
  join markers y
    on x.address = y.address
   and x.name = y.name
   and x.lat = y.lat
   and x.lng = y.lng
 order by x.distance limit 0, 20

Thanks. Point for what looks to be be a working solution. I'd prefer to not perform a third table join though. If this turns out to be the only answer (which is likely) I'll mark this as the answer. — Brandon Buster, Feb 22 '14 at 22:40

score 1 · Answer 2 · answered Feb 22 '14 at 21:19

You are getting duplicate results because the two points are both matching "myrtle beach". Use a condition like poi.id < m.id to ensure you only get one match.

Example:

poi id    m id    distance
1         2       100
2         1       100

Query:

SELECT 
    m.address,
    m.name,
    m.lat,
    m.lng,
    (3959 * ACOS(COS(RADIANS(poi.lat)) * 
    COS(RADIANS(m.lat)) * 
    COS(RADIANS(m.lng) - RADIANS(poi.lng)) + SIN(RADIANS(poi.lat))*
    SIN(RADIANS(m.lat)))) AS distance
FROM markers poi
CROSS JOIN markers m
WHERE 
    (poi.address LIKE "%myrtle beach%" OR m.address LIKE "%myrtle beach%")
    AND poi.id < m.id 
HAVING distance < 200
ORDER BY distance LIMIT 0,20

Or, if you truly do have a singular row in markers as the point of interest, specify that instead of any match on address. Then your condition of poi.id <> m.id will ensure there are no duplicates.

SELECT 
    m.address,
    m.name,
    m.lat,
    m.lng,
    (3959 * ACOS(COS(RADIANS(poi.lat)) * 
    COS(RADIANS(m.lat)) * 
    COS(RADIANS(m.lng) - RADIANS(poi.lng)) + SIN(RADIANS(poi.lat))*
    SIN(RADIANS(m.lat)))) AS distance
FROM markers poi
CROSS JOIN markers m
WHERE 
    poi.id = (SELECT TOP(1) id FROM markers WHERE address LIKE "%myrtle beach%")
    AND poi.id <> m.id 
HAVING distance < 200
ORDER BY distance LIMIT 0,20

Thanks for the reply. It does eliminate the duplicates, though it also assumes all matching records will have a higher ID than any of the records found in my target area which will not always be the case. It is giving me a different perspective on how I wrote the SQL though. — Brandon Buster, Feb 22 '14 at 22:43
@BrandonBuster, it doesn't assume so, given that in the first query, either the higher or lower numbered row can match. In the second query, we don't use a `>` comparison. — Mitch, Feb 23 '14 at 00:49

score 0 · Answer 3 · answered Feb 23 '14 at 00:07

Reviewing everyone's responses got me thinking. Instead of asking why I was getting duplicate results, I started wondering which of the two Myrtle Beach locations was the query calculating distances from? The answer was BOTH. And this explains why I was getting two records per match in the first place.

Here's my solution:

SELECT  m.address, m.name, m.lat, m.lng, (3959 
   * ACOS(COS(RADIANS(poi.lat)) * COS(RADIANS(m.lat)) 
   * COS(RADIANS(m.lng) - RADIANS(poi.lng)) + SIN(RADIANS(poi.lat))
   * SIN(RADIANS(m.lat))))     AS distance
FROM markers m
cross JOIN (
   select  name, lat, lng from markers
   where address like '%myrtle beach %'
   limit 1
) poi
HAVING distance < 200
ORDER BY name
LIMIT 0, 20

This doesn't give me the most accurate Distance calculations, as it arbitrarily uses the first restaurant it finds as the epicenter. But for my immediate purposes, this is good enough. I think for this this app to be production ready, I would need a second table for cities which would contain coordinates for the city center.

Duplicate Records in MySQL Geo Search Using Cross Join and Hversine Formula

3 Answers3