4

I'm developing a web application that will need to query a very large population density database. The data is basically latitude, longitude, and population count (you could abstract this as lat, lon, pop).

I'm going to be using MySQL and PHP for this. The data is very granular and so the total number of points is very large — on the order of several billions. (I don't actually know how large it is at this point, to be honest; I don't yet have the complete data, just samples to play with.) The server is just a cheap server farm machine (e.g. Bluehost).

The application will be trying to aggregate population data for circles of arbitrary radius centered at specific latitude and longitude coordinates. So basically I'll be saying, "tell me the total population for a circle of x radius centered at lat, lon." Which will likely involve just writing a very simple summing function.

Given all of the above, and the desire for this to be as fast and efficient as possible, my question is: what's the best database schema? I've looked here and learned some nice basics regarding the storing of this kind of data (floats work fine for the lat/lon data, the BETWEEN statement is faster than < and > operators), but I'm wondering if, given the bulk of the data, there's a better way to do this than having a table of several billion rows and three fields (e.g. lat, lon, pop).

Possible ideas that occurred to me would be separating out all of the major longitude headings into separate tables, but I don't know if that would really speed things up much. (I don't know a lot about MySQL optimization other than indexing.) Another related idea is to store super long strings of hex data for various "chunks" of the data (e.g. specific lat x lon squares.) Another alternative is to essentially use big binary bitmaps and just decode them on the fly (this strikes me as impractical for my relatively cheap server to manage, however).

But I'm not a DBA, and not even really much of a programmer (I'm not a beginner, but I'm not a professional), so I'd love to hear any other suggestions for how to do this, and whether it's a nuts idea to begin with given the current server processing power.

Community
  • 1
  • 1
nucleon
  • 861
  • 6
  • 22
  • The important thing is how is the database indexed. However I would hire or sub-contract such a task to a professional programmer or DBA.... Or otherwise take many hours or days to learn more about MySQL and experiment it (on at least a desktop with a real-sized database e.g. of several dozens of megabytes). You probably want to express the latitude & longitude as integers (e.g. in minutes of latitude), or perhaps as normalized strings... – Basile Starynkevitch Mar 30 '13 at 18:39
  • 1
    study the rtree indexes, and learn about the [spatial extensions](http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html) of MySQL. – didierc Mar 30 '13 at 18:43
  • @BasileStarynkevitch: I have more time than money, so I'm happy to test and experiment, once I have some good ideas to test. – nucleon Mar 30 '13 at 21:04
  • @didierc: Is there any real speed advantage to using Spatial vs. Floats? Per [this answer](http://stackoverflow.com/questions/159255/what-is-the-ideal-data-type-to-use-when-storing-latitude-longitudes-in-a-mysql) there seem to be some real problems with the Spatial extensions, and since I'm not calculating distances with it I don't really see the advantage. – nucleon Mar 30 '13 at 21:04
  • 1
    Then you probably don't need it. What you need are probably rtrees for indexes, I think, as they group rows by "regions". That said, there is no builtin for testing against a radius, the rtree regions are boxes. In order to properly use them, you'll have to compute the min and max longitude and latitude from the center and radius, and use them to define the box surrounding the circle, and then refine the search by eliminating false positives. That's pretty much all I reckon. – didierc Mar 30 '13 at 22:13
  • Note that postgresql also supports spatial indexing. I never used either, so I cannot make any recommendation, just mentioning their existence for you to dig in further. – didierc Mar 30 '13 at 22:17

0 Answers0