0

Say I have a table of users with an id column and a ranking column. Each user can rank the other users in some order. Let's say this list can be very long (max of some constant, say 10,000), but will often be much shorter. It will only be necessary to store and retrieve a user's whole list, that is, each list will not need to be e.g. searched in a query.

An idea is to store this as a list of ids in the form of a comma-separated string. The only downside to this is that a foreign key cannot connect each id in a list to its id column in the respective user, meaning if the id of a user changes, it will not automatically change in the lists. However, a user's id will never change, so this is a matter of principle (not breaking 1NF?).

Another idea is to create a table where each entry has a from and to user id column as well as a ranking column. The downside of this is that the table can potentially contain a very large number of records (e.g. billions) compared to the number of users. Also, when retrieving the list of a user, it needs to search through the many records. There is repetition of data, e.g. the ranking is now stored explicitly, rather than implicitly, this means making one change in the list can mean all records belonging to that list have to have their ranking column updated.

What is the better solution? Is there a different solution entirely?

Edit: I think most will say the second is best because it is a relational database, however, can you say the negative aspects could be mitigated or why they don't matter? What if the ordered lists could be even longer, e.g. millions of elements each, so the lists would be more like a blob of data and an equivalent table could contain trillions of entries?

When using a table the user will probably need to have the ranking copied over to an array in a front-end language to edit it and so either each change will have to be made both on front-end and back-end, or the old records must be deleted and a new record for each element of the new list inserted.

John
  • 41
  • 1
  • 7

1 Answers1

1

It's usually not a good design for a relational database.

Storing a comma-separated list of values is one type of denormalization. Good relational database design encourages normalization.

All types of optimizations improve one type of query, at the expense of other queries. In your case, if you only store or retrieve the whole list of id's, then it could be a good optimization. But if you ever want to add an id to the list, or search for a specific id, or be assured they are sorted correctly, or many other types of operations, then those tasks are not optimized.

There are actually many disadvantages to using comma-separated lists, not only the one about foreign keys you mention. I wrote an old answer about this here: Is storing a delimited list in a database column really that bad?

Using normalized design makes a database more flexible. That is, you can run many types of queries against the data, and none are especially disadvantaged.

So optimizations like denormalization require you to be sure that you know up front which queries are important for your project, and that you know you won't need any of the types of queries that are made more costly by the denormalized design. Or if you occasionally do need those queries, you don't need them to be efficient.

You expressed concern about making many rows if you store this in a normalized fashion, but most RDBMS products can handle billions of rows.

Searches should not scan a lot of rows if you create the right indexes. Which indexes are the right ones depends on which queries you need to optimize.


What if the ordered lists could be even longer, e.g. millions of elements each, so the lists would be more like a blob of data and an equivalent table could contain trillions of entries?

With respect, if you had to solve data management at that scale, then you wouldn't be asking how to solve it on Stack Overflow. You'd employ some senior software architecture experts to solve it.

They'd tell you basically the same thing: you have to be very specific about what types of queries you need to do against this data before they can choose an optimal architecture to support those specific queries. Because at that scale, you can't afford to do anything but an optimal approach.

If you don't need to solve the problem at the "trillions of elements" scale, then using the relational solution is adequate and offers flexibility, as I described above.

I see plenty of SO questions asking how Facebook manages data at their scale. The answer is almost always: "it doesn't matter what they do, because you will never have to do what they do at their scale."

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • This makes sense, I made an edit. Particularly is there a way to avoid updating all records with higher ranking numbers when deleting a record? – John Oct 12 '21 at 21:10