Holding tree data in SQL

Question

I am aiming to store locations in a table but as well as storing one location, I also want to link it to another.

Example: With a value of "Palm Beach" I also want to link this to "Florida" and "USA" to make the address:

Palm Beach, Florida, USA

Idea Number 1 One table to hold either a contry, city, town etc, but have it link to "parents".

CREATE TABLE location {
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parentid          
FOREIGN KEY parentid REFERENCES Employee (EmployeeID),
name VARCHAR (100) NOT NULL UNIQUE,
loc VARCHAR (17) NOT NULL,
rad VARCHAR (17),
added DATETIME DEFAULT NOT NULL
};

Idea Number 2: The same system but with separate tables for the levels

CREATE TABLE locality (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR (100) NOT NULL UNIQUE,
loc VARCHAR (17) NOT NULL,
rad VARCHAR (17),
added DATETIME DEFAULT NOT NULL,
FOREIGN KEY id REFERENCES administrative_area_level_1 (administrative_area_level_1),
};

CREATE TABLE administrative_area_level_1 (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR (100) NOT NULL UNIQUE,
loc VARCHAR (17) NOT NULL,
rad VARCHAR (17),
added DATETIME DEFAULT NOT NULL,
FOREIGN KEY id REFERENCES administrative_area_level_2 (administrative_area_level_2),

);

CREATE TABLE administrative_area_level_2 (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR (100) NOT NULL UNIQUE,
loc VARCHAR (17) NOT NULL,
rad VARCHAR (17),
added DATETIME DEFAULT NOT NULL
);

I really want to be able to get out the entire structure from one starting point, for example for a search of "Palm Beach" I want to be able to retrieve that it is from the USA.

Can anyone give me some input on the best approach to this please?

EDIT: I think this is the structure I want to use: https://stackoverflow.com/a/317536/1738522

This sounds like a use-case for a graph-based database like Neo4j. — Philipp, Aug 14 '13 at 08:20
NoSQL is going to cause me some headaches really, since this is setup on MYSQL Amazon RDS instance — Jimmy, Aug 14 '13 at 08:23
Are you using Postgres or MySQL? Because Postgres can easily handle the first model due to its ability to run recursive queries. — , Aug 14 '13 at 08:45
Fixed, sorry about that, I wasnt sure if RDS did postgresql when I asked it — Jimmy, Aug 14 '13 at 09:17

score 1 · Accepted Answer · answered Aug 14 '13 at 08:44

1

The short answer to this is that I've addressed this exact issue (geographical regions arranged in a hierarchy) using a Nested Set model ( More info available at http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ )

The longer answer is that you should not create a separate table for each level as that is not extendable to "n" levels easily and there is no simple route from level n to level 1. For this reason, idea #1 is the one you should pursue and elaborate on.

If ever in doubt, just sit back and look at how much you are duplicating a table structure. If you have totally duplicated a table structure then they should not (usually) be separate tables, as the duplication indicates you are storing the same data. Instead work out how to store it in a single table.

Within my "regions" table I have the following columns (among others)

regionID
parentRegionID
regionName
lft
rgt

The regionID and parentRegionID are populate easily enough. I then use a stored procedure to generate lft and rgt values which are used as detailed within the article I linked.

The benefit of storing the direct parentID is that you can more easily manipulate the tree. You don't need to do this though, and could easily just store lft and rgt using procedures to add/move/remove tree nodes

The use of lft/rgt allows easy traversal/retrieval of parents/children all the way up and down the tree without having n joins onto itself.

answered Aug 14 '13 at 08:44

Simon at The Access Group

6,932
22
45

Hi, thank you for the reply. Based on your information I created this, could you please check it? http://sqlfiddle.com/#!2/d28f6 – Jimmy Aug 14 '13 at 09:00
Furthermore, what happens when I am at the top of the heiracy or the bottom. If lft or rgt are non null then wont I run to an issue at either end? – Jimmy Aug 14 '13 at 09:01
The lft and rgt nodes are not needed, and actually dangerous, since they impose a non-existing order on the siblings. The fundamental way to identify siblings is "having the same parent". They are even worse in the nested set model, since their value depends on other nodes, causing major renumberings on inserts or deletes. – wildplasser Aug 14 '13 at 09:16
The lft and rgt have a purpose in this specific use in that it allows simple determination of, for example, airports within a geographic area. The parentID defines relationships, and the lft/rgt are used for reference/lookup to allow more efficient traversal of the geographic tree. Have the parentID as NULLable, and then the procedure that assigns lft/rgt use regions with NULL parents as the top of branches within your tree. As @wildplasser says, this method will result in many renumberings on inserts/deletes, but is aimed at a relatively static tree used for reading more than writing. – Simon at The Access Group Aug 14 '13 at 10:28
Here's a tweaked SQL fiddle http://sqlfiddle.com/#!2/ab0bc/1 - you would then need a procedure which will then index this, which will populate the lft/rgt/depth values within this. As mentioned, this method WILL get slow when it comes to reindexing, however when it comes to actually reading the data it will be far more usable than relying on recursive joins parent-child – Simon at The Access Group Aug 14 '13 at 10:33
It might lead to wet dreams for ORM people, for data modellers it is closer to a nightmare. IMnsvHO. BTW: it resmbles storing the tabbing-order of fields (which indeed can be needed at times) – wildplasser Aug 14 '13 at 11:46
@wildplasser what is your suggestion then? You've spoken out against mine but not put forward one. If there's a better way to handle this data that works for both reading + writing efficiently then I'd love to hear it, as I've been seeking better alternatives for a long while and so far this incarnation is the most efficient I've found for accessing and working with geographic areas with loosely defined levels i.e. continent/country group/country/state/county/resort/city/town/street/etcetcetc – Simon at The Access Group Aug 14 '13 at 11:57
Your solution might be appropriate in your particular case (which I don't know), but as a general solution it is ugly and violates 1NF. As a first, step I would propose separating the objects and the topology. The topology still is derived(+redundant) data (probably based on range-searches on lat/long inside the (sub)areas.) – wildplasser Aug 14 '13 at 12:07
The only area I see it violating 1NF is in having NULLable fields, wherein a NULL value is a valid value for a given row relating to a lack of parent. None of the data in the suggested layout (aside from perhaps a depth indicator) is redundant, though in my own implementation the hierarchy and data points are 2 seperate tables due to re-use of some data-points within the hierarchy. Without being able to define an area boundary clearly, lat/lng will not allow a hierarchy of geographic areas to be reliably returned. OP really needs to clarify their intentions to allow better advice though. – Simon at The Access Group Aug 14 '13 at 12:24

score 0 · Answer 2 · answered Aug 14 '13 at 09:16

0

People have been storing addresses in large, flat tables for a very long time for good reason -- they are very messy to deal with, and the theoretical hierarchical structure you seek is not achievable with real world data.

For example, there might be one city of "Palm Beach", making it easy to associate with a State, but how many cities are there named "London", "Oxford", "Springfield", etc? A great many, and of course in different states. So knowing that a city is named "Oxford" is not enough to identify it -- it is only unique in the context of a State.

Zip code boundaries are another problem -- not all five digit zip codes lie entirely within a single state -- so you cannot define a zip code as a child of a single state.

And that's not to mention internationalisation issues.

You're trying to invent a new solution for a problem that has already been solved, I'm afraid.

answered Aug 14 '13 at 09:16

David Aldridge

51,479
8
68
96

Google geocoding will give me what category the value is, for example that USA is a country and Washington is a state, effectively allowing me to have a Washington state and a washington city. Does this make my approach achievable or is it still flawed. Furthermore, is the large flat tables the "solution" you talk about? – Jimmy Aug 14 '13 at 09:19
And what about the dozens of "London"s in the World? Or the thousands (millions?) of "High St" or "Main St"s? How exactly are you planning to search on a name and get only a single match? – Richard Huxton Aug 14 '13 at 09:42
I was planning to use the most significant one and possibly constrain by country. For example if London appears as both a state and a locality I would use the state one since it is seen to be more significant – Jimmy Aug 14 '13 at 09:47
1

Yes, but which city called "Washington"? A city named "Washington" only has a unique identity in the context of it's state. Now if you are going to have a table of cities then that alone could be a hierarchy level below State, but zip codes cannot be in the same parent-child hierarchy -- they cannot even be a child of a single state. This is why addresses are most commonly stored as flat tables in information systems except for very specific geo-related purposes. If you were storing customer addresses for example, you would almost certainly not use a hierarchy of this type. – David Aldridge Aug 14 '13 at 09:49
Right I see. That is a valid comment. Are you suggesting a schema like this then? https://gist.github.com/employ/9c867d12c7aa4648874f – Jimmy Aug 14 '13 at 09:51
@Jimmy wouldn't you then be using the wrong one sometimes? Not sure what your system's purpose is, but I can think of quite a few situations where looking for the correct location in this way would be pretty disastrous. – David Aldridge Aug 14 '13 at 09:52
@Jimmy depending on your intent, then quite possibly. If you wanted to store correct addresses for customers in the US in particular then I'd look at the format that the US Postal Service uses and just copy that. – David Aldridge Aug 14 '13 at 09:54
I don't get your comment about "using the wrong one", I'm assuming thats with my old system having the wrong washington? I guess I could separate out the countries into a different table at least, since there shouldnt be more than one country with the same name? – Jimmy Aug 14 '13 at 09:57
I think that there's a valid case for a separate table for countries, as they have particular attributes that you might want to store against them. Linguistics, currencies, address formats even! By "wrong one" I mean that if you had a location "Washington" and needed to identify which of many Washingtons it referred to, then you could chose the wrong one if you did not also know whether it was a country, city, state, erm ... lake etc., and also what its parent is. – David Aldridge Aug 14 '13 at 10:07
Countries, cities etc, they are all geographic regions @DavidAldridge while it may be worth having a field that allows classification of regions into these artificial labels, ultimately it is a hierarchy. Sometimes an address will miss out segments of that hierarchy of course, but a whole Country, for example, will never be a child of a Street. When it comes to countries, a good reference would be the ISO codes as that at least gives you a reference point which is understood in other systems than your own. Personally I store ISO codes against each region to aid classification/organisation – Simon at The Access Group Aug 14 '13 at 10:37
@Simonatmso.net my point is not whether there is a hierarchy, or whether geographic entities can be classified by levels in a hierarchy, but whether in practice you can take an entity such as "a city named washington" and uniquely identify it without reference to its parent or some hypothetical standard code. Countries, US states, they have unique codes, but towns and cities generally do not unless you use some absolute geographic location (lat/long). Is it possible? Yes. Is it worth it? Only for very specialised systems -- most do not fall in that category. – David Aldridge Aug 14 '13 at 12:03
@DavidAldridge in fairness, more information as to OP's intention with this data is probably needed to advise appropriately – Simon at The Access Group Aug 14 '13 at 12:17

Holding tree data in SQL

2 Answers2