1

We want to store a user table in the database. For each user, there are a number of attributes, such as age, sex, date of birth, etc.

We can of course model these commonly shared attributes as the columns in our user table, however, it would become inconvenient later when we want to add new attributes, as we have to modify the table to add extra columns, and this user table may be large. Especially, it would be worse if we have more than one type of users and each user type may have some unique attributes.

Alternatively, we can adopt a more flexible design, where we use a separate table with the following schema:

user_id | attribute_id | attribute_name | attribute_value

So that each user may occupy a few number of rows, each of which contains the data for a specific attribute. This way, we don't need to worry about adding new attributes to users (or user types). However, there is another problem: the data type of each attribute may be different from each other, e.g. some may be int, some may be float, and some others may be string. In response, we can have another table for mapping the data types to attributes, such as:

attribute_id | data_type

Or, we can simply store everything as strings, but somehow I think it would be a bad idea in terms of the amount of space required.

So, there exists a tradeoff here, between the flexibility of the schema and the complexity/fragmentation of the schema, and I wonder which one would be overall more advantageous than the other, or, if there is a third and better way of doing it.

Any comment/suggestion is welcome! Thanks.

ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
skyork
  • 7,113
  • 18
  • 63
  • 103
  • 1
    If you look for information in text books or on the net about database normalization you'll find lots of answers to this question. There is no right answer, as each solution is a trade-off, as you've already said. The questions that only you can answer are in regards to how often new attributes will be added. If the attributes are relatively static and mostly in use I'd go with your first approach. – Richard A Mar 04 '12 at 00:42
  • I have seen the "user_id|attribute_id" approach implemented on several large database projects. In each case, the developers put a lot of thought into it, and it worked great at first. But after a few months of being in production, performance started going down the hill. The table became too much of a hotspot. – datagod Mar 04 '12 at 01:43

3 Answers3

2

It sounds like you might be considering a Dynamic Database TM.

In my opinion this is the wrong way to go; mainly for the reasons you've enumerated.

The second option:

Your attribute table would have to store everything as a string, then as you say map it back. This unnecessarily complicates things and means that whenever you select from the database, or do a comparison, you have to convert each value back to it's original data-type using your attribute_data_types table. Or you might choose to go down another route of having one column of each data-type and somehow dynamically choosing which column to update or select from depending on the data-type you're using.

If you do go down this route can I suggest that you add the data-type to that table, enforced with a check constraint rather than having attribute_data_types. The number of data-types isn't going to change often and you stop the addition of invalid-types to your data set by using this type of constraint.

Here is a previous question on the problems you might encounter with a dynamic database.

The first option:

Generally, I would store data at the same level, and preferably in the same table, as the unique id associated with it. So, if you have an attribute that's specifically joined to a user I would take the hit and add the column to the users table. You don't specify how many attributes you may want a user to have, so this may become ridiculous in the long run but you won't need to worry about it for a long time.

If you're worried about adding extra columns you can do a horrible hack at the beginning and add 20 extra columns, new_attribute1, ..., new_attribute20 and then rename them as you use them.

The third option:

You mention you're considering implementing user types... there is some data that you will want to store no matter the user, date of birth, name etc aren't something that's going to go away.

Though I don't like it as much as storing everything in the user table, you can keep all of your static attributes (date of birth etc) in the user table and then create a second table user_attributes, unique on user_id to store less generic ones, which you'll only have to join to on your less generic queries. It'll keep the size of the users table down.


At the end of the day, and as Richard A has commented, there is no correct way to go; you must do what is right for you. If it's possible, do a few tests first and see what works best for your own situation.

On a side note, never store age as you've mentioned. It changes every year and you have to keep updating your table; just store data of birth and do the calculation when you fetch the data.

Community
  • 1
  • 1
Ben
  • 51,770
  • 36
  • 127
  • 149
1

I agree that there is not one answer to this question.

Horror Story Time:
I had to implement a design developed by a college professor based on what was called "Indicative Information". The concept was basically an 80/20 rule. When designing database tables if the data colum was going to be used/populated 80% of the time it went into the base table. Otherwise it went into the "IndiciativeInfo" table.

The "IndicativeInfo" table became this monsterous hodgepdge collection of all types of miscellaneous data. What a nightmare. You couldn't inner join this table so you had to do separate secondary lookups for each of the 20% datatypes that might have been in this table.

The design took nearly 6 months to flush out on paper. When it finally got put into development, the designer (Mr. College Professor) was let go from the project.

Moral to this story... Never make one catch-all table for miscellaneous data.

If you are going to have different user types, than you might try implementing a super/sub type design.

The concept is simple. One base table say Users holds all the data that is common amongst all users. This is the SuperType table. Each user group would have their own separate SubType table.

I used this desing for a web based real estate database system. I created a SuperType table called Properties and several SubType tables (Residential, Condo, Commercial, MultiTenant, Land). The SubType tables hold values that are common amonst each subtype. This let me query all the Properties and then drill down into the SubTypes.

0

I would like to suggest another point to think about the design: performance.

There is a couple ways for designing schema which accomplish your requirement. The issues are not only the pros and cons over programming, but also something over performance. You may split table but if an attribute is always needed by most of screens or queries, there will be a performance bottleneck in future. In the opposite way, if your put the attributes into one table but those are just needed in a few screens or queries, there will be a performance bottleneck in some of more prominent screens or queries.

Besides the thinking of programming, I used to think about the performance while there are a lot of data in database.

For instance, if I need to query with a condition coming from a attribute, representing as numeric data, I would like to use it as integral type in database. Such strategy would give the database a opportunity to build index without too much effort and to query with advantages of index.

Although it's hard to foresee what will happen in future, planing is better than do nothing.

Mike Lue
  • 839
  • 4
  • 8