51

Of course I realize that there's no one "right way" to design a SQL database, but I wanted to get some opinions on what is better or worse in my particular scenario.

Currently, I'm designing an order entry module (Windows .NET 4.0 application with SQL Server 2008) and I'm torn between two design decisions when it comes to data that can be applied in more than one spot. In this question I'll refer specifically to Addresses.

Addresses can be used by a variety of objects (orders, customers, employees, shipments, etc..) and they almost always contain the same data (Address1/2/3, City, State, Postal Code, Country, etc). I was originally going to include each of these fields as a column in each of the related tables (e.g. Orders will contain Address1/2/3, City, State, etc.. and Customers will also contain this same column layout). But a part of me wants to apply DRY/Normalization principles to this scenario, i.e. have a table called "Addresses" which is referenced via Foreign Key in the appropriate table.

CREATE TABLE DB.dbo.Addresses
    (
        Id          INT
                    NOT NULL
                    IDENTITY(1, 1)
                    PRIMARY KEY
                    CHECK (Id > 0),

        Address1    VARCHAR(120)
                                NOT NULL,

        Address2    VARCHAR(120),

        Address3    VARCHAR(120),

        City        VARCHAR(100)
                    NOT NULL,

        State       CHAR(2)
                    NOT NULL,

        Country     CHAR(2)
                    NOT NULL,

        PostalCode  VARCHAR(16)
                    NOT NULL
    )

CREATE TABLE DB.dbo.Orders
    (
        Id          INT
                    NOT NULL
                    IDENTITY(1000, 1)
                    PRIMARY KEY
                    CHECK (Id > 1000),

        Address     INT
                    CONSTRAINT fk_Orders_Address
                    FOREIGN KEY REFERENCES Addresses(Id)
                    CHECK (Address > 0)
                    NOT NULL,

        -- other columns....
    )

CREATE TABLE DB.dbo.Customers
    (
        Id          INT
                    NOT NULL
                    IDENTITY(1000, 1)
                    PRIMARY KEY
                    CHECK (Id > 1000),

        Address     INT
                    CONSTRAINT fk_Customers_Address
                    FOREIGN KEY REFERENCES Addresses(Id)
                    CHECK (Address > 0)
                    NOT NULL,

        -- other columns....
    )

From a design standpoint I like this approach because it creates a standard address format that is easily changeable, i.e. if I ever needed to add Address4 I would just add it in one place rather than to every table. However, I can see the number of JOINs required to build queries might get a little insane.

I guess I'm just wondering if any enterprise-level SQL architects out there have ever used this approach successfully, or if the number of JOINs that this creates would create a performance issue?

Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
Vince Fedorchak
  • 1,145
  • 4
  • 11
  • 19
  • There are definitely related Q&A in SO - now hunting them. – Jonathan Leffler Oct 03 '11 at 19:33
  • 19
    You might find any of these helpful: [297855](http://stackoverflow.com/questions/297855), [310540](http://stackoverflow.com/questions/310540), [373084](http://stackoverflow.com/questions/373084), [929684](http://stackoverflow.com/questions/929684), [3094126](http://stackoverflow.com/questions/3094126), [4840928](http://stackoverflow.com/questions/4840928). – Jonathan Leffler Oct 03 '11 at 20:45
  • @Jonathan: All very good posts! I wish you could +1 on comments. :) – Vince Fedorchak Oct 03 '11 at 20:54
  • Such a nice convention, +1 for this! – Alix Axel Oct 31 '11 at 09:53
  • 2
    Outside of America and Canada, states can be three letter abbreviations or more - for instance "NSW" for "New South Wales" in Australia, or indeed ALL of the counties in Britain are three letters (AVN- "Avon", MCH - "Manchester") – binderbound Apr 13 '17 at 07:30
  • 1
    I know this is old, but I want to add a quick point. You may find storing an address as a FK in some tables results in behavior you don't want. Yes, if a customer address changes you only need to change it in one place, but there are some locations that you likely don't want updated. One of those being order history. If a customer places an order, receives it, is happy, then moves and updates their address, the past order address should not change. It was already shipped and delivered, changing its address means it is now wrong. It also means you lose the actual address it shipped to. – Air May 20 '20 at 16:07

9 Answers9

40

You're on the right track by breaking address out into its own table. I'd add a couple of additional suggestions.

  1. Consider taking the Address FK columns out of the Customers/Orders tables and creating junction tables instead. In other words, treat Customers/Addresses and Orders/Addresses as many-to-many relationships in your design now so you can easily support multiple addresses in the future. Yes, this means introducing more tables and joins, but the flexibility you gain is well worth the effort.

  2. Consider creating lookup tables for city, state and country entities. The city/state/country columns of the address table then consist of FKs pointing to these lookup tables. This allows you to guarantee consistent spellings across all addresses and gives you a place to store additional metadata (e.g., city population) if needed in the future.

Joe Stefanelli
  • 132,803
  • 19
  • 237
  • 235
22

I just have some cautions. For each of these, there's more than one way to fix the problem.

First, normalization doesn't mean "replace text with an id number".

Second, you don't have a key. I know, you have a column declared "PRIMARY KEY", but that's not enough.

insert into Addresses 
  (Address1, Address2, Address3, City, State, Country, PostalCode)
values
  ('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
  ('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
  ('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
  ('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500');

select * from Addresses;

1;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
2;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
3;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
4;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500

In the absence of any other constraints, your "primary key" identifies a row; it doesn't identify an address. Identifying a row is usually not good enough.

Third, "Address1", "Address2", and "Address3" aren't attributes of addresses. They're attributes of mailing labels. (Lines on a mailing label.) That distinction might not be important to you. It's really important to me.

Fourth, addresses have a lifetime. Between birth and death, they sometimes change. They change when streets get re-routed, buildings get divided, buildings get undivided, and sometimes (I'm pretty sure) when a city employee has a pint too many. Natural disasters can eliminate whole communities. Sometimes buildings get renumbered. In our database, which is tiny compared to most, about 1% per year change like that.

When an address dies, you have to do two things.

  • Make sure nobody uses that address to mail, ship, or whatever.
  • Make sure its death doesn't affect historical data.

When an address itself changes, you have to do two things.

  • Some data must reflect that change. Make sure it does.
  • Some data must not reflect that change. Make sure it doesn't.

Fifth, DRY doesn't apply to foreign keys. Their whole purpose is to be repeated. The only question is how wide a key? An id number is narrow, but requires a join. (10 id numbers might require 10 joins.) An address is wide, but requires no joins. (I'm talking here about a proper address, not a mailing label.)

That's all I can think of off the top of my head.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • 1
    Thank you for the tips. You say that identifying a row is not good enough but then you don't provide an alternate solution. Do you mean that my primary key should be made from a combination of columns in the table rather than an IDENTITY INT? How then would I refer to a particular address in my foreign keys? – Vince Fedorchak Oct 04 '11 at 12:29
  • Also, you say that DRY doesn't apply to foreign keys. But in this case it does because I am "not repeating" the address columns across multiple tables and instead breaking them out to their own table. This is an almost textbook definition of DRY - making sure that if a change needs to be made, it is made in one place, which was my whole purpose in creating the table this way. – Vince Fedorchak Oct 04 '11 at 12:33
  • 2
    You *are* repeating the foreign key; you're just repeating an integer instead of text. When a foreign key is meaningful text, `ON UPDATE CASCADE` lets you update the value in a single place, unless your dbms doesn't support that feature. (Oracle doesn't.) But when you're dealing with addresses, you usually need to take extra care to *avoid* updating some kinds of historical data, like the addresses on bills and invoices. And that's true regardless of whether your foreign key is text or an integer. – Mike Sherrill 'Cat Recall' Oct 04 '11 at 13:11
  • @Catcall: Very, very insightful, +1. – Alix Axel Oct 10 '11 at 13:07
  • 1
    And remember, you can still use the surrogate key if you place a unique index on the natural key. This gets you teh best of both worlds as you can;t put in duplicate data, but you have something smaller to join on and you don't have to change a million child records because the the zipcode got changed. – HLGEM Oct 02 '13 at 14:33
  • So what's the solution to uniquely identify a logical address record? Should we do some sort of HashKey that is a combination of `AddressLine1`, `City`, `State` and `Zip`, so code can identify that the incoming address already exists with PK / `Id = 1` (in your example above)? – Shiva Jan 09 '16 at 01:31
  • @Shiva: One way is uniquely identify addresses is to combine authoritative forms (we used USPS databases to look up the "correct" form) with a unique index on the address columns. Code should usually just insert, and handle the "duplicate key" error; you have to handle errors anyway. – Mike Sherrill 'Cat Recall' Nov 16 '18 at 12:21
  • 1
    Except in very rare cases, it's a Bad Idea(tm) to create generate a surrogate key without a corresponding natural key. If you don't updates get very difficult and, unless pains are taken, many duplicate entries are created (as indicated by @mike-sherrill-cat-recall above). Nor is the join overhead of many surrogate keys and many-to-many intersection tables negligible. I'm not saying don't create an address table. I'm saying think through the entire life cycle of the data to determine if it really works for you. – Charlie Reitzel Aug 07 '19 at 23:12
12

You would want the addresses to be in a separate table only if they were entities in their own right. Entities have identity (meaning it matters if two objects pointed to the same address or to different ones), and they have their own lifecycle apart from other entities. If this was the case with your domain, I think it would be totally apparent and you wouldn't have a need to ask this question.

Cade's answer explains the mutability of addresses, something like a shipping address is part of an order and shouldn't be able to change out from under the order it belongs to. This shows that the shipping address doesn't have its own lifecycle. Handling it as if it was a separate entity can only lead to more opportunities for error.

"Normalization" specifically refers to removing redundancies from data so you don't have the same item represented in different places. Here the only redundancy is in the DDL, it's not in the data, so "normalization" is not relevant here. (JPA has the concept of embedded classes that can address the redundancy).

TLDR: Use a separate table if the address is truly an Entity, with its own distinct identity and its own lifecycle. Otherwise don't.

Nathan Hughes
  • 94,330
  • 19
  • 181
  • 276
  • I don't see this as normalization, but there is some benefit in standardization - in the sense that the basic address table design is consistent for all "addresses" and that if a design change occurs, it only has to be made in one place. – Cade Roux Oct 03 '11 at 20:36
  • 1
    @Cade: yes, consistency is a good goal to have, but there could be other ways to reduce duplication, such as using a script to generate the DDL. – Nathan Hughes Oct 03 '11 at 20:48
  • indeed, if you have multiple addresses in the same entity perhaps in different roles or not, normalization would apply, however, and then it starts to make more sense to do it across entities as well. – Cade Roux Oct 03 '11 at 21:07
  • Another Great Post. Well Said Nathan Hughes. – Mosia Thabo Aug 06 '19 at 21:51
  • @Mosia: thank you, i appreciate the feedback. i did take another look at this and edited a bit. – Nathan Hughes Aug 08 '19 at 13:10
12

I think there is a problem you are not aware of and that is that some of this data is time sensitive. You do not want your records to show you shipped an order to 35 State St, Chicago Il, when you actually sent it to 10 King Street, Martinsburg WV but the customer moved two years after the order was shipped. So yes, build an address table to get the address at that moment in time as long as any change to the address for someone like a customer results in a new addressid not in changing the current address which would break the history on an order.

HLGEM
  • 94,695
  • 15
  • 113
  • 186
  • Most use cases will prefer the current address and, thus, benefit from normalization. But, as you say, there are other use cases that require the "as of" address. For these cases, it's usually required to take a snapshot of the current values in to the frozen table. This logic applies to many types of data, not just addresses. It could be products (which get discontinued) or prices (which change), etc. – Charlie Reitzel Aug 07 '19 at 23:18
3

What you have to answer for yourself is the question whether the same address in everyday language is actually the same address in your database. If somebody "changes his address" (colloquially), he really links himself to another address. The address per se only changes when a street is renamed, a zip-code reform takes place or a nuke hits. And those are rare events (hopefully for the most part). There goes your main profit: change in one place for multiple rows (of multiple tables).

If you should actually change an address for that in your model - in the sense of an UPDATE on table address - that may or may not work for other rows that link to it. Also, in my experience, even the exact same address has to look different for different purposes. Understand the semantic differences and you will arrive at the right model that represents your real world best.

I have a number of databases where I use a common table of streets (which uses a table of cities (which uses a table of countries, ...)). In combination with a street number think of it as geocodes (lat/lon), not "street names". Addresses are not shared among different tables (or rows). Changes to street names and zip codes cascade, other changes don't.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Would you mind explaining your last paragraph a bit more? I don't really understand any of it and I'm interested in how you handle addresses. What do you mean by a 'common' table of streets? Is street number stored in the same table? How do you link, e.g., Users to addresses? And, say, Orders? When you say addresses are not shared, do you just mean that you any row in your street table is 'owned' by precisely one other row (in another table)? In what way do changes to street names and zip codes cascade? And do you just mean in terms of id, or do you 'stamp' fields? – Vorpulus Lyphane May 19 '17 at 05:32
  • Vorpulus - Imagine having a table with address and seperate tables for cities and countries. It's a fact that a city or a country has longer life cycles - has a longer life cycle and everyone registering will select a known city, therefore you can create a table for them to always opt from. But Addresses(Street Number and Name) can change anytime. So you'll just reference city and Country by ID – Mosia Thabo Aug 06 '19 at 22:01
2

One should maintain some master tables for City, State and Country. This way one can avoid the different spellings for these entities which might end up with mapping same city with some different state/country.

One can simply map the CityId in the address table as foreign key as shown below, instead of having all the three fields separately (City, State and Country) as plain text in address table itself.

Address: {
    CityId
    // With other fields
}

City: {
   CityId
   StateId
  // Other fields
}

State: {
   StateId
   CountryId
 // Other fields
}

Country: {
  CountryId
  // Other fields
}

If one maintains all the three ids (CityId, StateId and CountryId) in address table, at the end you have to make joins against those tables. Hence my suggestion would be to have only CityId and then retrieve rest of the required information though joins with above table structure.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
2

You would normally normalise the data as far as possible, so use the table 'Addresses'.

You can use views to de-normalise the data afterwards which use indexes and should give a method to access data with easy references, whilst leaving the underlying structure normalised fully.

The number of joins shouldn't be a major issue, index based joins aren't too much of an overhead.

dougajmcdonald
  • 19,231
  • 12
  • 56
  • 89
2

It's fine to have a split out addresses table.

However, you have to avoid the temptation of allowing multiple rows to refer to the same address without an appropriate system for managing options for the user to decide whether and how changing an address splits out a row for the new address change, i.e. You have the same address for billing and ship-to. Then a user says their address is changing. To start with, old orders might (should?) need their ship-to addresses retained, so you can't change it in-place. But the user might also need to say this address I'm changing is only going to change the ship-to.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
  • Excellent point. The way I was thinking of handling this was to not allow multiple rows to reference the same ID and simply create a new entry in the Addresses table for every new address, even if it exists before. This would still allow me to find unique addresses via UNION or SELECT DISTINCT. Does that jive? – Vince Fedorchak Oct 03 '11 at 19:43
  • Now that I'm thinking about it, perhaps another solution would be to add a new column to the Addresses table and call it ChangedTo, which will eventually refer to the new record in the Address table if the original address was ever modified. This would enable a trail that would show every update to the address back to the original (where ChangedTo is NULL). – Vince Fedorchak Oct 03 '11 at 19:57
  • @VinceFedorchak I would not build auditing into this design but instead use a general auditing approach like AutoAudit (http://autoaudit.codeplex.com/). As far as unique addresses, whether that is relevant depends upon your reporting requirements, why you are looking for unique addresses etc. I view the Addresses as a sharing of a common design to simplify the database and standardize a component even though it is used in (possibly quite) different roles. – Cade Roux Oct 03 '11 at 20:33
0

I prefer to use an XREF table that contains a FK reference to the person/business table, a FK reference to the address table and, generally, a FK reference to a role table (HOME, OFFICE, etc) to delineate the actual type of address. I also include an ACTIVE flag to allow me to choose to ignore old address while preserving the ability to maintain an address history.

This approach allows me to maintain multiple addresses of varying types for each primary entity

Steve S
  • 11
  • 2