6

I have approximately 400,000 documents in a GAE Search index. All documents have a location GeoPoint property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.

I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:

from google.appengine.api import search

# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
    doc_id='meaningful-unique-id',
    fields=[search.GeoField(name='location' 
                            value=search.GeoPoint(coords[0], coords[1]))])

# find document function radius is in metres
def find_document(coords, radius=1000000):
    sort_expr = search.SortExpression(
        expression='distance(location, geopoint(%.3f, %.3f))' % coords,
        direction=search.SortExpression.ASCENDING,
        default_value=0)

    search_query = search.Query(
        query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
                    % (coords[0], coords[1], radius),
        options=search.QueryOptions(
            limit=1,
            ids_only=True,
            sort_options=search.SortOptions(expressions=[sort_expr])))

    index = search.Index(name='document-index')
    return index.search(search_query)

With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.

I narrowed the problem down to the radius parameter being too large. I get correct results if the radius is down to around 12km (radius=12000). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000).)

The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000 (12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?

Dan
  • 5,013
  • 5
  • 33
  • 59

3 Answers3

5

I believe the issue is the following. Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.) So I suspect that some of the geographically closer points are not included in this 10k selection. That's why things work better when you narrow your search radius, as you have fewer total points in that radius.

Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on. You can address this in at least a couple of ways, which you can combine:

  • Add a ranking, so that the most 'important' docs (by some criteria that makes sense in your domain) are returned in rank order, then these will be sorted by distance.
  • Filter on one or more document field(s) (e.g., 'business category', if your docs contain information about businesses) to reduce the number of candidate docs.

(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).

Amy U.
  • 2,227
  • 11
  • 11
  • Thanks for the confirmation of what @Middy and I presumed was happening behind the scenes. Knowing that the 'hit' limit is 10k definitely helps. Just for completion I'll state the answer to **I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?** is that I can't consistently if my search radius _might_ contain more than 10k documents - which unfortunately my app might do :( – Dan Mar 13 '13 at 21:00
1

I have the exact same problem, and I don't think its possible. The problem happens as you yourself has figured out when there are more possible results than returned results. The Google algorithm just quits when it has loaded the limits and then it sorts the results.

I have seen the same clusters as you and its part of the search API.

One Hack would be to subdivide your search into sub-sectors, do multiple simultaneous calls and then merge and order the results.

Middy
  • 91
  • 1
  • 9
  • From a pragmatic perspective the hack would work if you knew how dense your densest clusters of `GeoPoint`s were likely to be. (At the expense of Search API quota.) However if you don't know how dense your densest clusters are then sub-sectors could suffer with the same problem of giving wrong results without you knowing. In my case 20,000 Search API calls per day does not leave much room for doubling up calls for each user request. – Dan Mar 11 '13 at 22:56
0

Wild idea, why not keep/record the distance from 3 points then calculate from that.