128

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.

  • Is this going to be a help or hindrance?
  • Speed considerations in calling, updating or searching/manipulating?

Here's an example of some of my table structure(s):

  • users - UserId, username, email, encrypted password, registration date, ip
  • user_details - cookie data, name, address, contact details, affiliation, demographic data
  • user_activity - contributions, last online, last viewing
  • user_settings - profile display settings
  • user_interests - advertising targetable variables
  • user_levels - access rights
  • user_stats - hits, tallies

Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.

Most of the tables have a 1:1 relationship which was the main reason for denormalising them.

Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?

C R
  • 2,182
  • 5
  • 32
  • 41
Peter Craig
  • 7,101
  • 19
  • 59
  • 74
  • This [other question](http://stackoverflow.com/questions/8685621/what-is-the-best-database-schema-to-support-values-that-are-only-appropriate-to/9460541#9460541) might be helpful too – Mosty Mostacho Oct 13 '13 at 05:37

9 Answers9

77

Multiple tables help in the following ways / cases:

(a) if different people are going to be developing applications involving different tables, it makes sense to split them.

(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).

(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.

(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.

(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.

My vote would be for multiple tables - with data appropriately split.

Good luck.

user115905
  • 814
  • 6
  • 2
  • 4
    @RohitKhatri : To the best of my knowledge, having multiple tables will increase performance in most cases. – Hari Harker Sep 20 '16 at 08:56
  • 1
    @HariHarker Thanks for your answer, but I figured out that It depends on your access pattern. – Rohit Khatri Sep 20 '16 at 09:31
  • Until recently I was always storing all data in one table, but come to think of it, it has a lot of advantages to split data in terms of performance (depending on the use case ofcourse), semantics (some data is better grouped in a different table) and development. For instance I'm developing a custom ERP-system right now on top of a legacy system. I had to expand the old database tables with extra columns. I decided to make new tables for the new data. Some new features come in handy for the legacy system and now I can easily integrate them without having to rewrite too much of the old queries – Ogier Schelvis Nov 15 '16 at 11:15
41

Combining the tables is called denormalizing.

It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.

MySQL is capable of using only JOIN method, namely NESTED LOOPS.

This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.

Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.

Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.

If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.

Maintenance hell, on the other hand, is guaranteed.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • 3
    If you have 10000 users and you are doing a join with a database set up with foreign keys correctly then you should only need the intense lookup by doing something like select * from users where name="bob". Once you have bob then you are using an index to find the joined tables to bob which is significantly faster because you are using bob's id. This happens regardless of if you are doing a join in your query or querying bob then querying a table separately. Of course hopefully your second query is based on bob's id and not something else. – Rudy Garcia Sep 12 '16 at 16:37
18

Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.

Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.

All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.

David Hedlund
  • 128,221
  • 31
  • 203
  • 222
  • 1
    I have to completely disagree with your comment that normalization only focuses on neatness and completely disregards performance. There is a trade off in both scenarios and denormalization actually puts data integrity at risk. I would say normalization of your database actually improves the overall performance of the database rather than having a quick negligible performance increase from a denormalized table. – Rudy Garcia Sep 12 '16 at 16:32
  • Given that the discussion is specifically about 1:1 relationships, splitting the tables isn't a *normalization* task, right? If there is no duplicated information, its normal even when its a single table. (Well, it might not satisfy `3NF` normalization, so benefit from a second table to resolve that, but that doesn't seem to be what OP is referring to re the other tables.) – ToolmakerSteve Apr 18 '19 at 13:28
17

Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.

Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.

ETA:

If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).

If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.

Hari Harker
  • 702
  • 1
  • 12
  • 29
Eric Petroelje
  • 59,820
  • 9
  • 127
  • 177
  • Yes each table only has 1 row for each user, simply to save the headache of managing a lot of duplicated data. This is why I'm thinking one table suits. If user data spanned multiple rows, I would expect to have those tables separated from the main user table. – Peter Craig Jul 14 '09 at 12:28
  • 1
    If every table has a 1 to 1 relation then one table would be easier to use. There is no need to split the table in that case. Splitting the table sugests that there are more then 1 row, which could lead to a case where another developer would treat them that way. – Richard L Jul 14 '09 at 12:34
  • Very interesting thought applying 80/20 to database table design. Got me thinking also on OOP (I'm primarily a Java developer) class design and wondering if the same might be effective there (put primary 80% application functionality in one class and the rest in other classes). – Zack Macomber Jul 02 '18 at 13:46
  • 1
    @ZackMacomber - No, class-splitting should be based on *locality of reference*. The benefit to splitting into multiple classes, is to draw a border around a smaller unit of functionality, so that it is easier to comprehend/test/change, and clear where that unit interacts with other units of functionality. The goal is to keep most connections (references, calls) *inside* one unit, with few connections *between units*. Defining several *interfaces* that the class implements, with different interface per use case, can be a useful first step towards that splitting. – ToolmakerSteve Apr 18 '19 at 13:32
  • @ToolmakerSteve Good thoughts +1 – Zack Macomber Apr 18 '19 at 13:45
9

Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.

Edit: I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...

a large portion of these cells are likely to remain empty

If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.

I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).

  • I like this approach, because the denormalized data only exists temporarily, as a snapshot of a moment in time. No insert/modify/delete issues - just throw it away when done. – ToolmakerSteve Apr 18 '19 at 13:44
7

Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.

Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.

Rudy Garcia
  • 522
  • 1
  • 6
  • 19
  • The Wordpress `wp_usermeta` table grows geometrically. Each user adds X rows to the `wp_usermeta` table, one row for each piece of meta information we want to keep for that user. If you keep 8 custom fields for each user, that means wp_usermeta will be `users * 8` rows long. This seems to be causing performance issues, but I'm not sure if that's the issue or not… – thirdender Apr 03 '14 at 15:27
  • 1
    I could see how this could cause performance issues if you have tens of thousands of users. Basically the database would have to search through 10000 * 8 entries in user meta table to find the ones your looking for. However if you only query the Meta data when needed I would think your performance would be better. If you are always asking for the meta data even when you don't need it then you might have issues. If you always need the meta data then maybe splitting the tables isn't the best approach. – Rudy Garcia Apr 04 '14 at 15:49
  • 1
    Just yesterday we dealt with a WP theme that was loading all users (using `get_users()`) just to calculate the pagination. Once we corrected the code to use a `SELECT COUNT(…)` query for the pagination instead, the page load time went from 28 seconds to about 400ms. I still wonder how the performance compares to joined tables or a single flat table… I've had trouble finding any performance metrics on the web. – thirdender Apr 05 '14 at 07:54
  • Thinking about my previous comment it would seem that splitting the table is still efficient unless for some reason, such as the above example of pagination, you would need to select all users. Although if you are retrieving all of the meta information you would still have 80k entries in the usermeta table. That is a lot to search through. Perhaps someone could test what is a better approach by running a script on both implementations and run it 100 times to get the average, I might just do that. – Rudy Garcia May 28 '14 at 17:53
  • 1
    I read through this again just today and realized that my comment about 10000 * 8 entries is true, however the way a database works should make it mostly a non issue. If for some reason you were grabbing all 10000 users AND then also their meta information this would be ridiculous. I can't think of any scenario where you would want this. A database will easily retrieve the meta for a single user with lightning speed though because of foreign keys and indexing. Assuming your db model is set up correctly. – Rudy Garcia Sep 12 '16 at 16:00
  • @RudyGarcia : For a system I am developing right now, very soon there will be close to 10000 users and I am in need of storing **10+** tables that will have `1:1` relationship with `user` table. The system is all about storing and retrieving the complete profile of the user and nothing else. So I'm assuming this `user_meta` table with just `link_id` concept should work.But there is one requirement where the admin might want to make a booklet of all the profiles. Do you think, at that time, this design might be screwed? – Hari Harker Sep 20 '16 at 09:05
  • Do you mean by booklet they will want to query every user and meta data together? I am not sure why this would be necessary. No human could likely read a list of 10k users. If you mean they want to print a directory like a phone book then I don't think that waiting a few minutes is unreasonable as it will take the printer 100 times that long to print that info. You may run into memory issues if you are trying to do this though both on machine and printer. Your best bet is to test what you need. No matter how you spin it getting all 10k + meta will be ridiculous even without splitting tables. – Rudy Garcia Sep 20 '16 at 13:22
  • My gut instinct is that this `meta` approach is *only* worthwhile if a dynamic set of keys is desired. For a fixed set of keys, when its known that a single row matches, `LIMIT 1` query is preferable, even if most of the fields are empty (assuming empty means `null` or a zero-length varchar), The userid index goes straight to the needed row - no table scan needed. That more than makes up for having each row be longer to hold those empty fields. – ToolmakerSteve Apr 18 '19 at 13:49
5

I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.

Tundey
  • 2,926
  • 1
  • 23
  • 27
1

I would say it depends on what the other tables really mean. Does a user_details contain more then 1 more / users and so on. What level on normalization is best suited for your needs depends on your demands.

If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.

To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users. But the rest are probably alot of rows per user?

Richard L
  • 1,211
  • 7
  • 10
0

Performance considerations on big tables

"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.

Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.

Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.

TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.

Rick James
  • 135,179
  • 13
  • 127
  • 222