-1

Seeking general thoughts/advise, and links to books/resources are greatly appreciated!

I am building a forum where for each topic there are posts, replies to posts, replies to the replies and so on for many levels... much like a commenting section. At the moment, all posts are in the same table, but I wonder if, as the number of topics increases, I should be concerned by issues stemming from the size of the posts table. The main concern would be a decrease in speed with which posts on a topic are retrieved. The other concern of mine is that errors or malicious hacking could lead to, for example, the wrong posts being displayed on a topic (not all users are supposed to have access to all topics).

The alternative solution would be to create a new table of posts whenever a new topic is created. Perfectly doable, but it would have the disadvantage of creating an ever expanding database, which I presume would be harder to maintain and handle. Am I right about the latter?

In short: is there any point (e.g. expected number of topics/posts) at which size of table becomes an issue? Mainly out of curiosity, but how do giant commenting/social media platforms handle posts?

Posts like What's the ideal database table structure for forum posts and replies? of Table design for a forum don't answer my question. Thanks in advance for any insights!

Shadow
  • 33,525
  • 10
  • 51
  • 64
dan
  • 11
  • 2
  • Please read [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) to see how you can improve your question. – Andy Preston Mar 04 '23 at 16:00
  • see: [Limits on table size](https://dev.mysql.com/doc/refman/8.0/en/table-size-limit.html). Do you expect your table size to grow over 1TB ? – Luuk Mar 04 '23 at 16:12
  • In short: **no**, the rowcount of a table does not (or rather: _should not_) matter: one of the first things you learn in CS is that retrival of data from a sorted data-source will never take longer than `O( log n )` time (where `n` is the size of the data-source) - in practice this means retrieving a specific post from a table with 100 billion rows will only take infinitesimally more time than from a table with 1,000 rows. – Dai Mar 04 '23 at 16:13
  • Thanks! What about security concerns? – dan Mar 04 '23 at 16:21
  • @dan The size of a table, or more broadly: _however_ the data stored (be it CSV flat-files, remote Redis or `memcached`, RDBMS tables or whatever) _is completely immaterial_ to the security of a system accessed through a web-application. If your system "gets hacked" through a vulnerability in your website code such that an attacker can make arbitrary changes to your data then it's game-over - whether your table has 10 rows or 1,000,000 rows makes no difference and I don't understand why you think it is... – Dai Mar 04 '23 at 16:41
  • ...unless you're thinking that `FOREIGN KEY` constraints are some kind of security measure? ...well, _sort-of_: they are an _absolutely essential_ part of ensuring data-isolation in a multi-tenant scenario, but so is [row-level data-security](https://learn.microsoft.com/en-us/sql/relational-databases/security/row-level-security) - _and dozens of other DB features, usage techniques and practices_ - but FKs aren't ostensibly _security feature_ per-se (unlike, again, row-level security). And in any even, foreign-keys have nothing to do with the rowcount of a table. – Dai Mar 04 '23 at 16:45
  • Thanks @Dai. The thinking was simply that, if I create a separate table for each topic (each table would have an assigned code) then it can't be as easy for a bug or a hack to access data, since even if, say, the topic_id gets messed with, the corresponding posts aren't there to be retrieved directly. But of course there would be a table linking posts and table codes, so someone accessing the DB would be able to find everything - but they couldn't do so right from hacking the session of a user (which only has their user_id and topic_id session variables). But I guess it is a moot point? – dan Mar 04 '23 at 16:52
  • @dan _"if I create a separate table for each topic"_ <-- **do not** do that... _just don't_: _that's just not how RDBMS databases work_. Please, please please read a textbook on database-theory that explains the fundamentals because you seem to be misunderstanding a lot right now. – Dai Mar 04 '23 at 16:55
  • Good book on relational theory: [SQL and Relational Theory: How to Write Accurate SQL Code, 3d edition](https://www.amazon.com/SQL-Relational-Theory-Write-Accurate/dp/1491941170) – Bill Karwin Mar 04 '23 at 17:03
  • @Dai, I hope it was clear that I meant "a table of posts" for each topic, as explained in the question. I have already built the database and the forum in the usual way, it works fine, and I was simply considering an alternative, if unorthodox structure. Anyhow, thanks for your time! – dan Mar 04 '23 at 17:15
  • @dan Other people/companies (with deep pockets) have already provided similar services. Years have been spent - do you have years available and financial resources? – Wilson Hauck Mar 06 '23 at 14:55

1 Answers1

0

Indexing is the most effective way you can improve performance of queries. The trick is to avoid a query finding the rows it needs without examining many rows that are not what it needs. You might like my presentation How to Design Indexes, Really or the video.

Data types can affect optimization too. Use the smallest data type that supports the values you need to store in a column. Use numerics and dates instead of making everything a string. Use a consistent character set and collation for all string columns.

Archive obsolete data. I can't tell you how many times I've seen a very huge table and if pressed the developers admit that 99% of the data is garbage, but they can't bring themselves to remove it from the table. If you have a table that grows continually, you must implement a strategy to "prune" it periodically. If the data is important for historical purposes, that's fine, you don't have to destroy the data. You could preserve it in some larger, cheaper storage.

MySQL Server tuning options. This is a large topic and it can take years to learn how to use MySQL options to optimize your workload. There are good books like High Performance MySQL. It'd be easier to just hire a good MySQL consultant to review your server tuning. Be prepared to do this about once a year as you grow.

Caching is saving frequently-requested data in RAM for quicker access. The most performance SQL query is the one you don't have to run at all, because the data you need is already in memory.

How large is the maximum table size? See my answer to https://stackoverflow.com/a/2716470/20860

For what it's worth, at my last job we configured an alert if any single table was over 512GB, or if the total schema was over 1024GB. These were more or less chosen as arbitrary round numbers (at least round numbers to a computer architect). There's no magic failure that occurs at that size, it's just an indicator because we figured if a table grows to half a terabyte, then it will continue to grow in an unbounded way, and it needs to be managed.

I would not recommend your idea of creating a new table per topic. This leads to runaway growth of the number of tables, and the tables share resources like memory and storage anyway. If a single table is so large that you have performance problems even after using indexes, then it probably won't help to split it into multiple tables.

Another problem with a table per topic is that they won't be uniform in size. You could still have a single topic-specific table that is larger than you want.

How do giant social network companies deal with this? They scale out to multiple database servers. This introduces a lot of complexity to your application, but it does allow virtually unlimited growth, because you add more memory, storage, and CPU resources as you add servers. If you're just starting out, you're probably not going to have a need to do this anytime soon.

If you do plan to grow this site indefinitely, you should be prepared to change optimization strategies from time to time. When I was a MySQL performance consultant, we had a saying that for every 10x growth in scale, you should reevaluate your current architecture and consider changing it. The solutions at different scales may be different.

You also said you were concerned about errors or hacking. This is another topic, unrelated to performance. I won't go into this in detail, except to give the general advice to use parameterized queries and avoid formatting dynamic queries with string concatenation or variable expansion. Study the guides at https://owasp.org/ for more information.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Don't forget sharding and partitioning too - which don't even have to be distributed partitions: even local-partitioning boosts performance by aiding the query-planner in reducing the size of lock ranges and different partitions can be stored on physically different storage for greater concurrent throughput. – Dai Mar 04 '23 at 16:48
  • @Dai Yes, sharding is related to the point I mentioned about using multiple servers. Partitioning has limited uses, in practice. Almost every time I've seen people consider it, they end up using indexing more effectively. The [unique key restriction of partitioning](https://dev.mysql.com/doc/refman/en/partitioning-limitations-partitioning-keys-unique-keys.html) prevents its use in most cases where people consider using it. – Bill Karwin Mar 04 '23 at 17:00
  • @dan You might also like my own book, [SQL Antipatterns: Avoiding the Pitfalls of Database Programming, Volume 1](https://pragprog.com/titles/bksap1/sql-antipatterns-volume-1/) – Bill Karwin Mar 04 '23 at 17:20