24

Suppose that you want to store "tags" on your object (say, a post). With release 9.4 you have 3 main choices:

  • tags as text[]
  • tags as jsonb
  • tags as text (and you store a JSON string as text)

In many cases, 3rd would be out of question since it wouldn't allow query conditional to 'tags' value. In my current development, I don't need such queries, tags are only there to be shown on posts list, not to filter posts.

So, choice is mostly between text[] and jsonb. Both can be queried.
What would you use? And why?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
comte
  • 3,092
  • 5
  • 25
  • 41
  • 8
    http://www.databasesoup.com/2015/01/tag-all-things-part-2.html –  Mar 08 '15 at 18:43
  • 1
    nice link. i have my answer : i'm suprised how fast JSONB is, but still ARRAY performs usually better. thx. – comte Mar 09 '15 at 16:57
  • 1
    Really nice link! The queries for the JSONB and List examples (part 3 in the article) look a lot simpler as well, compared to the normalized table versions. – Rotareti Apr 08 '18 at 18:55
  • Yeah, if a normalized schema isn't desired, seems like jsonb and text[] are apparently similar, with text[] winning out in a few scenarios in the tag-all-things link above- – chrismarx Apr 06 '22 at 14:16
  • Something else to consider is that the array type allows for more comparison options (&&, @>, etc), the json array only @> (restrictive AND contains); something I found out the hard way – svenema Mar 25 '23 at 11:50

2 Answers2

15

In most cases I would use a normalized schema with a table option_tag implementing the many-to-many relationship between the tables option and tag. Reference implementation here:

It may not be the fastest option in every respect, but it offers the full range of DB functionality, including referential integrity, constraints, the full range of data types, all index options and cheap updates.

For completeness, add to your list of options:

  • hstore (good option)
  • xml more verbose and more complex than either hstore or jsonb, so I would only use it when operating with XML.
  • "string of comma-separated values" (very simple, mostly bad option)
  • EAV (Entity-Attribute-Value) or "name-value pairs" (mostly bad option)
    Details under this related question on dba.SE:

If the list is just for display and rarely updated, I would consider a plain array, which is typically smaller and performs better for this than the rest.

Read the blog entry by Josh Berkus @a_horse linked to in his comment. But be aware that it focuses on selected read cases. Josh concedes:

I realize that I did not test comparative write speeds.

And that's where the normalized approach wins big, especially when you change single tags a lot under concurrent load.

jsonb is a good option if you are going to operate with JSON anyway, and can store and retrieve JSON "as is".

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    normalized schema are not my cups of tea : it adds two tables, needs join to perform requests, but above all, fragments data. i guess that's why arrays have been implemented : one object, one table. also my viewpoint is that most of 'normalized schema' advantages have to be implemented in the application, otherwise you have a problem with the application. why do you say "jsonb is only a good option if you are going to operate with JSON anyway" ? that's the case, but jsonb has one great advantage : flexibility. – comte Mar 09 '15 at 16:42
  • 1
    If anything picking custom datatypes such as JSONB does not offer flexibility. You will be limited by what database libraries/ORMs you can use as most do not support collection like datatypes. – Adam Gent Mar 09 '15 at 17:26
  • 2
    that's why ORMs gains are (in my mind) questionable. Not sure what you gain in simplicity worth what you loose in flexibility. They lag DB progress, they have a (small) cost to get in. But it's a personal view though. – comte Aug 22 '15 at 11:54
  • If tags are ordered, updates will perform terribly. – stickfigure Jun 09 '20 at 17:25
  • @comte "needs join to perform requests" - what's wrong with joins? – Tom Sep 13 '20 at 07:19
  • @Tom: joins are slow. Subqueries are faster for instance. And with numerous objets, it's quick to have multiples joins in one query, and then you get really slow. Here is a short article I just found to illustratre and put a scale on it : https://medium.com/squad-engineering/blazingly-fast-querying-on-huge-tables-by-avoiding-joins-5be0fca2f523 – comte Sep 14 '20 at 08:18
  • 2
    @comte: You can formulate involved joins as subqueries if you prefer. But, while joins have a cost, the claim that `joins are slow` is too unspecific to be true. It depends. – Erwin Brandstetter Sep 14 '20 at 12:26
  • @comte: The query planner is a complex beast, and your article doesn't help it much. "purchase.user_id" is not marked as a foreign key, and its type is different from "user.id". Also, "tested on postgresql 9.6.2"; worth retesting with something more recent. And if these things don't make a difference then you could just duplicate the WHERE condition to filter both tables. Would be surprised if it didn't work. I see your citation and raise you this https://www.brianlikespostgres.com/cost-of-a-join.html where we achieve "joining 50 tables with 1M rows each, in just 12ms". – Tom Sep 17 '20 at 15:15
1

I have used both a normalized schema and just a plain text field with CSV separated values instead of custom data types (instead of CSV you can use JSON or whatever other encoding like www-urlencoding or even XML attribute encoding). This is because many ORM's and database libraries are not very good at supporting custom datatypes (hstore, jsonb, array etc).

@ErwinBrandstetter missed a couple of other benefits of normalized one being the fact that it is much quicker to query for all possible previously used tags in a normalized schema than the array option. This is a very common scenario in many tag systems.

That being said I would recommend using Solr (or elasticsearch) for querying for tags as it deals with tag count and general tag prefix searching far better than what I could get Postgres to do if your willing to deal with the consistency aspects of synchronizing with a search engine. Thus the storage of the tags becomes less important.

Adam Gent
  • 47,843
  • 23
  • 153
  • 203
  • I certainly missed most of the pros and cons, since this is a huge topic and I just named a couple of the prominent features. You added some more, ORM support probably being important for many ... – Erwin Brandstetter Mar 09 '15 at 22:43