Basic SQL design question: One table with lots of repeated text, or two separate tables?

Question

A general database design question:

Suppose I'm putting together a fairly large database - say tens of thousands of rows, not many columns - and I want one of those columns to be an identifying 'tag' that would most obviously be a long-ish string of 30-50 characters.

And the number of distinct tags would be fairly limited.

Every introductory thing I've read about SQL and database design suggests that you should be careful about choosing the right data types for your info - don't store info as a float if you can get by storing it as an integer, don't store info as text if you can store it as a number - to keep your database from getting unnecessarily large. But I'm not sure how much difference that makes to performance.

In my particular case, it seems like it would be easy enough to create a second table with all of my English 'tags' paired with some arbitrary integer 'code'. Then, I could store only the integer codes in my main table, making it smaller. But, then, for a user to access the data they want from the main table - without knowing the integer codes and only knowing the English tags - every query would (I think) have to look something like:

"SELECT * FROM maintable WHERE code = (SELECT code FROM secondtable WHERE tag = 'English tag')"

That seems like it would work, but then every query would effectively be two queries.

So, is there likely to be a meaningful performance benefit from making the main table smaller, by replacing the English 'tags' with integer 'codes'? And, if so, is that performance improvement likely to outweigh the performance cost of essentially doubling the number of queries?

Is there a rule of thumb on this issue? A generally preferred way of proceeding? If so, I would love to know before getting too far in.

score 3 · Answer 1 · answered Jun 11 '20 at 00:15

In my opinion it is better in terms of maintenance and scalability that you have two separate tables with indexed values, so you will have a TAG table each with the indexed ID and your main table that refers only to those IDs. And as pointed in this question it is better to use JOIN to get both values, so you will doing something like this:

SELECT
    *
FROM maintable m
JOIN tagtable t ON t.id = m.tag //You can choose also LEFT JOIN
WHERE
    m.tag='english tag'

But at the end, it all depends of your specific needs.

score 0 · Accepted Answer · answered Jun 15 '20 at 18:13

Let's make a concrete example. The question is considering two approaches:

Take1: only one table, directly embedding (and duplicating) the tag strings, something along the lines of:

CREATE TABLE take1 (
    id   INTEGER NOT NULL PRIMARY KEY,
    tag  TEXT,
    foo1 INTEGER,
    foo2 INTEGER
);

Take2: two tables, where the main table refers to the tag strings via IDs, something along the lines of:

CREATE TABLE take2 (
    id     INTEGER NOT NULL PRIMARY KEY,
    tag_id INTEGER,
    foo1   INTEGER,
    foo2   INTEGER
);

CREATE TABLE tags (
    tag_id INTEGER NOT NULL PRIMARY KEY,
    tag    TEXT
);

First comment: take1 uses a table that is not normalized. For example, imagine that you decide to edit one of the tags (say it was misspelt). With take1, you have to edit all the rows that embed the given tag. With take2, you have to edit only 1 row.

I wrote a small program using SQLite to compare the performance of the two approaches, assuming that the main table has 80K rows and that the unique tags are 38.

From a file size point of view:

take1 generates a database file of 4.7MB in 200ms.
take2 generates a database file of 1.3MB in 170ms.

From a performance point of view:

For take1 the query

select count(*)
from take1
where tag = "consciousness as a paradox.";

takes 8ms

For take2 the query

select count(*)
from take2, tags
where take2.tag_id = tags.tag_id and
tags.tag = "consciousness as a paradox.";

takes 10ms

But, thinking about it for a moment, we can also write

select count(*)
from take2
where take2.tag_id = (
  select tag_id from tags where tag = "consciousness as a paradox."
);

and this one takes 5ms.

So, according to these micro-benchmarks, take2 is better not only from a design point of view, but also from a performance point of view (at least to perform the queries above).

Last point: I wrote a program because I also needed to fill the tables, but if one has already the tables available, to get timing information one can also directly use the sqlite shell with .timer on:

$ sqlite3 take2.db
SQLite version 3.28.0 2019-04-15 14:49:49
sqlite> .timer on
sqlite> select count(*)
  ...> from take2
  ...> where take2.tag_id = (
  ...>   select tag_id from tags where tag = "consciousness as a paradox."
  ...> );
2105
Run Time: real 0.005 user 0.004756 sys 0.000089

(The 2105 comes from 80K / 38).

This is a wonderfully thorough answer and I'm grateful, @marco.m. I had, in fact, already started putting my database together using this structure, based on EQrxJx's answer, but I'm glad to see some hard numbers demonstrating the benefits. Thank you both. — jmaloney13, Jun 17 '20 at 12:27
@jmaloney13 happy that it helped! Given your comment, I would suggest to consider accepting one of the two answers :-) — marco.m, Jun 17 '20 at 13:17

Basic SQL design question: One table with lots of repeated text, or two separate tables?

2 Answers2