2

Let's assume we want to store following data in the relational database: CountryName, CapitalCityName, CapitalCityPostCode. Lets assume that one city has only one post code. We can implement it in one table in a trivial way:

Countries
[PK]CountryId, CountryName, CapitalCityName, CapitalCityPostCode

Or we can arrange it in the more normalized way into 2 tables in 1:1 relation:

 Coutries  
 [PK]CountryId, CountryName, [FK]CapitalCityId

and

 CapitalCities
 [PK]CapitalCityId, CapitalCityName, CapitalCityPostCode, [FK]CountryId

How will this affect the performance? For example - if we need to list all the countries with its capitals names, will it be significantly faster in the first case? How many records/columns do I need to have to see the difference?

Maciej Wanat
  • 1,527
  • 1
  • 11
  • 23
  • 1
    Obviously you can see the 1st isn't 3rd normal form. Performance wise a properly normalized table will be on par to a flat table as you have in the first example even when dealing with 10 and hundreds of millions of records. Though the flat file will always be slightly faster but at trivial amounts if properly related. The issue with the first becomes scalability over time. You're giving up a slight gain in performance for an unstable foundation if growth is needed. – xQbert Jun 05 '19 at 13:36
  • @xQbert That does answer my question - can you add it as an answer instead of a comment, so other people can refer to that? – Maciej Wanat Jun 05 '19 at 13:45
  • 1
    The performance difference would depend on how you access the tables in practice. Obviously, the extra storage of the required keys adds deadweight, and having to perform the join imposes a performance penalty (and additional query complexity). The only possible gain would be if the columns in Capital city were very rarely read, whereas Countries was often read, and both tables were extremely large, and in that situation having a condensed Countries table could possibly have some performance advantages. It would probably be a corner case however, to be addressed only if it had already arisen. – Steve Jun 05 '19 at 13:53
  • 1
    Also consider: https://stackoverflow.com/questions/1125004/which-is-more-efficient-multiple-mysql-tables-or-one-large-table as a previous similar quesiton. – xQbert Jun 05 '19 at 13:55

3 Answers3

2

Obviously you can see the 1st isn't 3rd normal form. Performance wise a properly normalized table will be on par to a flat table as you have in the first example even when dealing with 10s and 100s of millions of records. Though the flat file will always be slightly faster but at trivial amounts if properly related. The issue with the first becomes scalability over time. You're giving up a slight gain in performance for an unstable foundation if growth is needed

It's a Marginal difference at best. The single table will always have a slight advantage; which would become more pronounced when you're dealing with hundreds of millions of records+. But there's ways around that by partitioning the table into relevant blocks so the engine can multi-thread the results gathering and eliminate lots of unneeded records based on join & filter criteria.

As within any other development there is no one single silver bullet. There are always exceptions to the rules; context matters for each question. However, the broad brush approach says, normalize unless you KNOW there will NEVER be growth. (never's a long time! but then maybe the system has a known shelf life and will never achieve such long term existence. )

xQbert
  • 34,733
  • 2
  • 41
  • 62
2

In the above example, with tables for Country and CapitalCities, normalization is not very helpful. The one to one relationship doesn't cause any duplicacy or complications in reading and updating data. It would be helpful in case of data that need tables like Country and City with Country having capital city as fk of City. Reading from two tables require a join which is definitely slower(not by a great margin) than reading from a single table. With several hundreds of record the difference seen would be of 10 to 15 ms.

G. Pillai
  • 237
  • 1
  • 7
0

In general, if there is a one-to-one correspondence, then there's no advantage to splitting the data into two tables. You are not "normalizing" the data if there is no redundancy. Simply specify that the CapitalCityName column is indexed if you need to search by city. You will still get the search performance and remove the overhead of having to join two tables for the query.

On the other hand, if there are many postal codes for each city (very common), then a separate table would be required to normalize the data and provide a list of postal codes for each city (or the city for a specific postal code). But that brings in a new issue: The postal codes may not be unique: the same code might be repeated in more than one country, or there is even situations where "twin cities" share a common postal code. But that's an issue for another discussion.

daShier
  • 2,056
  • 2
  • 8
  • 14
  • Not to mention some cities have [multiple post codes](https://www.google.com/search?q=how+many+postal+codes+does+chicago+have%3F&rlz=1C1GCEA_enUS786US786&oq=how+many+postal+codes+does+chicago+have%3F&aqs=chrome..69i57j33.4815j1j7&sourceid=chrome&ie=UTF-8); so that concept wouldn't be supported with this structure. the link was for Chicago; but DC has hundreds of zip codes; some per building! – xQbert Jun 05 '19 at 13:37
  • Well right, it's not a perfect example. I added note that we assume that one city has one postcode. But normalization aside - I'm just interested in performance differences between these two. – Maciej Wanat Jun 05 '19 at 13:39
  • Marginal difference at best. The single table will always have a slight advantage; which would become more pronounced when you're dealing with hundreds of millions of records. But there's ways around that by partitioning the table into relevant blocks so the engine can multi-thread the results gathering and eliminate lots of unneeded records based on join & filter critieria. As within any other develpment there is no one single silver bullet. There are always exceptions to the rules; context matters for each question. The broad brush approach says, normalize unless you KNOW no growth. – xQbert Jun 05 '19 at 13:41