0

A user signs up for my site and enters in their zip code. I want to query for other users, and sort by distance.

I have a database full of zip codes with lat/lon points for each zip code.

zip_code (char)
lat (float)
lon (float)

I have a method which will calculate the distance between two sets of lat/lons, but to run this on every other zip code in my db is expensive. I'd need to run this on every zip code combination. I suppose I can do it once and store it somewhere, but where would I store it? Seems strange to have a table for every zip code which would contain the distance to every other zip code. Is there a clean way to do this?

Ben174
  • 3,654
  • 2
  • 17
  • 15
  • you need a graph (which is stored in cache) :http://networkx.github.io/ . Just init the edges' weights with the distance, and the node with the zip_code ID. That way you have O(1) (nodes lookup) + O(|E|) (linear in the node's degree) – lucasg Jul 01 '13 at 14:29
  • I think the title of your question is misleading because that's not what you want to know how to do. – martineau Jul 01 '13 at 14:35
  • Are your users in the United States? Then sort the lat/lon based on lon. Then you will not need to test points where the lon's are 2 or more degrees away. – TreyA Jul 01 '13 at 14:59

2 Answers2

0

Doing it once and storing it somewhere sounds good to me. Here are some ideas that might give good performance with some consideration to storage space without sacrificing accuracy:

There are something like 43,191 zip codes, so the full would be 1,865,462,481. But the distances are of course symmetrical and the self-to-self ones are useless, which immediately cuts it down to 932,709,645 entries. We might also cut the space by realizing that a bunch of zip codes are either the same as each other, or one contains the other (e.g. 10178 seems to be inside 10016, and they're both geographically small). Many zip codes will have no users at all, so we might avoid populating those until they're needed (i.e. lazy load the cache). And finally, you can probably throw away large-distance results, where large is defined as a distance greater than is useful for your users.

For a more algorithmic view, see this previous question: Calculate distance between zip codes and users

Bonus tip: don't forget about non-US users. Poor non-US users.

Community
  • 1
  • 1
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
0

Here's a solution with a fair amount of overhead, but which will pay off as your dataset size, user base, and/or number of transactions grow:

If you don't already have one, use a database that supports spatial types and spatial indexing. I recommend the PostGIS extension for PostGres, but most of these steps apply to other spatially-enabled databases:

  1. Store your zip code location as Point geometry type instead of a two columns for lat and long.
  2. Create a spatial index against the Point geometry column. Every time you add a new zip code, its location will automatically be added to the spatial index.
  3. Assuming you don't want to show "nearest" neighbors that are thousands of miles away, use a Within function (ST_DWithin in PostGIS) to filter out those zip codes that are too far away. This will significantly reduce the search space for close neighbors.
  4. Finally use a Distance function (ST_Distance in PostGIS) to calculate the distance between your zip code of interest and its closer neighbors, and use the DB to return results sorted by distance.

By using a database with spatial index and a filtering function that uses that index, you can significantly speed up your search. And when the time comes to do more spatial analysis or show maps, you'll already have a framework in place to support that new functionality.

lreeder
  • 12,047
  • 2
  • 56
  • 65