1

I'm currently debating between two strategies to using a text column as a key.

The first one is to simply use the text column itself as a key, as such:

create table a(
    key_a text primary key,
)

create table b(
    key_b text primary key,
)

create table c(
    key_a text,
    key_b text,
    
    foreign key("key_a") references a("key_a"),
    foreign key("key_b") references b("key_b")
)

I'm concerned that this would result in every key being duplicated, once in a and b and another in c, since text isn't stored inline.

My second approach is to use an autoincrement id on the first two tables as a primary key, and use those ids on table c to refer to them, as such:

create table a(
    id_a integer,
    key_a text unique,
    primary key("id_a" autoincrement)
)

create table b(
    id_b integer,
    key_b text unique,
    primary key("id_a" autoincrement)
)

create table c(
    id_a integer,
    id_b integer,
    
    foreign key("id_a") references a("id_a"),
    foreign key("id_b") references b("id_b")
)

Am I right to be concerned about text duplication in the first case? Or does sqlite somehow intern these and just use an id for both, akin to what the second strategy does?

forpas
  • 160,666
  • 10
  • 38
  • 76
Filipe Rodrigues
  • 1,843
  • 2
  • 12
  • 21
  • Does this answer your question? [Using text as a primary key in SQLite table bad?](https://stackoverflow.com/questions/23157411/using-text-as-a-primary-key-in-sqlite-table-bad). What about optimization: you may open your database as plain text and see that the text FK is stored as is. – astentx Jun 27 '21 at 23:32
  • @astentx The answers there mostly focus on the performance of using a text key, I'm looking to see if the strings are duplicated, I understand there might be a slight performance hit, but storage is more important in my use case. – Filipe Rodrigues Jun 27 '21 at 23:57

3 Answers3

2

SQLite does not automatically compress text. So the answer to your question is "no".

Should you use text or an auto-incrementing id as the primary key? This can be a complex question. But happily, the answer is that it doesn't make much difference. That said, there are some considerations:

  • Integers are of fixed length. In general, fix length keys are slightly more efficient in B-tree indexes than variable length keys.
  • If the strings are short (like 1 or 2 or 3 characters), then they may be shorter -- or no longer -- than integers.
  • If you change the string (say, if it is originally misspelled), then using an "artificial" primary key makes this easy: just change the value in one table. Using the string itself as a key can result in lots of updates to lots of tables.
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Actually, Sqlite stores integers using a variable-length encoding. See https://www.sqlite.org/fileformat2.html#record_format – Shawn Jun 28 '21 at 01:36
1

Am I right to be concerned about text duplication in the first case? Or does sqlite somehow intern these and just use an id for both, akin to what the second strategy does?

Yes, you are right to be concerned. The text will be duplicated.

Also, even if you did not define an integer primary key in your 1st approach, there is one.

From Rowid Tables:

The PRIMARY KEY of a rowid table (if there is one) is usually not the true primary key for the table, in the sense that it is not the unique key used by the underlying B-tree storage engine. The exception to this rule is when the rowid table declares an INTEGER PRIMARY KEY. In the exception, the INTEGER PRIMARY KEY becomes an alias for the rowid.

The true primary key for a rowid table (the value that is used as the key to look up rows in the underlying B-tree storage engine) is the rowid.

In your 2nd approach actually you are not creating a new column in each of the tables a and b by defining an integer primary key.
What you are doing is aliasing the existing rowid column:

  • id_a becomes the alias of rowid of the table a
  • id_b becomes the alias of rowid of the table b.

So, defining these integer primary keys is not more expensive in terms of space in the parent tables.

Although with your 1st approach you can avoid explicit updates in the child tables when you update a value in the parent tables by defining the foreign keys with ON UPDATE CASCADE, your 2nd approach is what I would suggest.

An integer primary key with a value assigned to it by the system and you don't even have to know or worry about it is common practice.
All you have to do is use that primary key and its corresponding foreign keys in the queries that you create to access the parent tables when you want to fetch from them the text values.

forpas
  • 160,666
  • 10
  • 38
  • 76
0

For performance (also it is a good db practice) you should stick to numeric/int value for the Primary Key.

As for the second approach, I'm not getting the concept you are after. Could you elaborate more on this?

marcin2x4
  • 1,321
  • 2
  • 18
  • 44
  • 1
    The second approach is just a way to make sure that every text is just stored once, and then can be refereed to by it's id. So instead of `insert into a values ("KeyA1")` (same for table `b`) and then `insert into c values("KeyA1", "KeyB1")`, I'd instead do `insert into a ("key_a") values ("KeyA1")` (same for table `b`) and then do `insert into c values(1, 1)` (assuming I know the id is 1 for both afterwards), thus not repeating the string, in hopes it won't be duplicated within the database – Filipe Rodrigues Jun 28 '21 at 00:01