0

I have a database of information organized by US zipcode. I am building algorithms that crawls along adjacent zipcodes to determine the size of a 'city' based on density, job characteristics, or whatever. I use the location and area of any zip code to estimate which other zipcodes are adjacent to it. I am becoming aware that this algorithm is eating up most of my processing time when I run test of my program.

So what I want to do is have the map (as in a data structure map) of adjacent zipcodes in a table in my database.

My current implementation is that I have a table with two fields, source and target. Each time my algorithm determines two zipcodes are adjacent, the two codes are inserted both way into the table, as so:

+-----------+------------+  
| source    |  target    |
+-----------+------------+
| 02139     |  02138     |
| 02138     |  02139     |
+-----------+------------+

That way I can search for all adjacent zip codes with

SELECT target FROM adjacent WHERE source = '02139';

and get all the zip codes that are adjacent to '02139'.

Now strictly speaking, my implementation is just fine. For a set of less than 50,000 total zip codes, doing it the way I did doesn't really impose any computational penalties. However, not being indexed, and having each relationship inserted twice seems non-optimal, and since I'm just doing this for funzies and learning, I should put in the effort to optimize. So I'm trying to find out how to more efficiently simulate a mapping using a mysql table.

So the question is: what is the most efficient way to represent 1-to-n mappings using MySQL?

kingledion
  • 2,263
  • 3
  • 25
  • 39
  • This is **not** a 1:n mapping, this is a graph. – Shadow Jan 14 '17 at 18:45
  • @Shadow Well, a graph can be decomposed into a series of 1:n mappings of nodes to adjacent nodes. I suppose the question could be, "What is the best way to store a graph in MySQL" and that would also give me what I want to know. – kingledion Jan 14 '17 at 20:18
  • A directional graph can indeed be decompised into a series of 1:n relationship of nodes. However, your graph is unidirectional, where a 1:n relationship is an oversimplification of the actual relationship. You pay the price of the oversimplification by having a more complex query. In case of an unidirectional graph you need to ask yourself if you want to optimise for storage and maintenance or you would like to optimise for the speed of traversing the graph. This is not a right or wrong question. The optimal solution can only be determined in the actual application. – Shadow Jan 14 '17 at 20:36
  • Regarding closure as a duplicate, I view this from the perspective of the google search-er. If you are searching for a 1:n mapping, how are you to know that you are actually conceiving of a graph unless there are posts that tell you so? – kingledion Jan 14 '17 at 22:16
  • Again, this question has nothing to do with 1:n mapping. That's a simple parent - child relationship. And the duplicate closure link would connect the 2 questions anyway. The question and the answers are the same for both topics, so this one does not add any value to be honest to this site. – Shadow Jan 14 '17 at 22:27

1 Answers1

3

In your application, the concept of adjacency seems to be bidirectional (aka symmetric). That is,

A adj B if and only if B adj A

So you can consider "canonicalizing" the relationship and then always store the zip with a smaller numerical value in the first column and the one with the larger numerical value in the second column. That is, using your example, you now only have one row:

+-----------+------------+  
| zipLower  |  zipHigher |
+-----------+------------+
| 02138     |  02139     |
+-----------+------------+

And then when you need to find all the neighboring zip of, say, 02139, your query may look like this (assuming the new table is called adjHigher):

SELECT zipHigher as zip
FROM adjHigher 
WHERE zipLower = '02139'
union
SELECT zipLower as zip 
FROM adjHigher 
WHERE zipHigher = '02139'

Pros and cons

Is this really a more optimal design? It depends. This design uses half of the storage space, and insertion into the table may be more efficient (only one row, not two rows, per adjacent relationship is needed to be insert). However, you can also see that the lookup query becomes more complicated. If you have to JOIN this table with other tables, this design may make your JOINs more complicated.

I guess the intention of this discussion is to explore different design options before committing to one. So here it is.

leeyuiwah
  • 6,562
  • 8
  • 41
  • 71
  • 1
    If I do that, what command can I then use to find all adjacent zips to '02139'? Since now my 'source' zip could be in either of the fields. – kingledion Jan 14 '17 at 18:37
  • 1
    Use a view with a select like `select ziplower source, ziphigher target from neighbors union all select ziphigher, ziplower from neighbors`. – Carsten Massmann Jan 14 '17 at 18:49
  • 1
    @kingledion -- I have updated my question above to address your question. – leeyuiwah Jan 14 '17 at 18:55