Isn't using unnormalized design better when there are multiple JOINS?

Question

Here is my table structure:

// posts
+----+-----------+---------------------+-------------+
| id |   title   |        body         |   keywords  |
+----+-----------+---------------------+-------------+
| 1  | title1    | Something here      | php,oop     |
| 2  | title2    | Something else      | html,css,js |
+----+-----------+---------------------+-------------+

// tags
+----+----------+
| id |   name   |
+----+----------+
| 1  | php      |
| 2  | oop      |
| 3  | html     |
| 4  | css      |
| 5  | js       |
+----+----------+

// pivot
+---------+--------+
| post_id | tag_id |
+---------+--------+
| 1       | 1      |
| 1       | 2      |
| 2       | 3      |
| 2       | 4      |
| 2       | 5      |
+---------+--------+

As you see, I store keywords in two ways. Both as string into a column named keywords and as relational into other tables.

Now I need to select all posts that have specific keywords (for example php and html tags). I can do that in two ways:

1: Using unnormalized design:

SELECT * FROM posts WHERE keywords REGEXP 'php|html';

2: Using normalized design:

SELECT     posts.id, posts.title, posts.body, posts.keywords 
FROM       posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE      tags.name IN ('html', 'php')
GROUP BY   posts.id

See? The second approach uses two JOINs. I guess it will be slower than using REGEXP in huge dataset.

What do you think? I mean what's your recommendation and why?

An easy one. No. In MySQL functions cannot use indexes. Also, that's really an abuse of the GROUP BY clause - although, frustratingly, it is faster than its correct counterpart DISTINCT — Strawberry, Jun 10 '17 at 05:50
The first query might look faster but in reality it is going to be much slower because the use of `REGEXP` MySQL has to compare every value in your `keyword` column with the regexp. Any indexes on `keyword` will be ignored and the whole `posts` table will be scanned. As `posts` table grows bigger the query will turn slower. — Giorgos Betsos, Jun 10 '17 at 05:50
Possible duplicate of [Is storing a delimited list in a database column really that bad?](https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad) — philipxy, Jun 10 '17 at 10:51
You guess. So what? My recommentation is that you read about abstraction & optimization in programming, particularly SQL/DBMSs. PS What do you mean, "normalized design"? There's no normalized design here, just a 2nd query that uses (explicit) join in an unnormalized design. — philipxy, Jun 10 '17 at 10:56

score 1 · Answer 1 · answered Jun 10 '17 at 06:25

In small tables, you can use both at your discretion.

If you expect the table to grow, you really need to second choice. The reason behind is that The regexp can never use an index in MySQL. And indexes are the key to fast queries. join will use an index if an index is declared on the column;

score 1 · Accepted Answer · answered Jun 10 '17 at 09:47

The second approach uses two JOINs. I guess it will be slower than using REGEXP in huge dataset.

Your intuition is simply wrong. Databases are designed to do JOINs. They can take advantage of indexing and partitioning to speed queries. More advanced databases (than MySQL) use statistics on tables to choose optimal algorithms for executing the query.

Your first query always requires a full table scan of posts. Your second query can be optimized in various ways.

Further, maintaining the consistency of the data in the data is much more difficult with the first approach. You probably need to implement triggers to handle updates and inserts on all the tables. That slows things down.

There are some cases where it is worth the effort to do this -- think about summary counts or totals of dollars or time. Putting tags into a delimited string is much less likely to be beneficial, because parsing the string in SQL is not likely to be a really big benefit relative to the other costs.

score 0 · Answer 3 · answered Jun 10 '17 at 11:15

All these look good when we talk about data in lower scale. It's very fundamental theory for an OLTP system to have denormalize tables. When you expect your table to scale and want data to be non-redundant and consistent, normalization is the answer. Of course there are costs involved with join but thats trivial with all these issues.

Lets talk about your scenario:

Pros:

all data available querying one table.

Cons:

function wrapped across columns force query optimizer to scan the whole table irrespective of the column index. This is very important from data scaling point of view.
Keyword in your case repeated multiple time leading data redundancy.
Keywords appear multiple times lead to data inconsistencies, if you want to remove/update a keyword, it requires column to be searched and replace everywhere from each row. And if anycase anywhere the keywords left behind, leads data integrity issues.

There are many more. Go through data normalization in RDBMS.

Isn't using unnormalized design better when there are multiple JOINS?

1: Using unnormalized design:

2: Using normalized design:

3 Answers3

Linked