0

I have to design a database schema for an application I'm building. I will be using MySQL. In this application, users enter data and it gets saved in the database obviously. However, this data is not accessible to the public until the user publishes the data. Currently, I have one column for storing all the data. I was wondering if a boolean field in this table that indicates whether the data has been published is a good idea. Or, is it much better design to create one table for saved data and one table for published data and move the saved data to the published data table when the user presses Publish.

What are the advantages and disadvantages of using each one and is one of them considered better design than the other?

Yang K
  • 407
  • 4
  • 13
  • Sounds like published or not is just another attribute of the entity your users enter. So it should go in one table with a column for that attribute. If you want to conveniently query only published things you could define a view that selects only the published rows from the table. – sticky bit Jan 05 '19 at 04:52
  • @stickybit thanks for your reply. do you think there might be significant query performance boosts with either of the methods? Is one faster than the other when you're trying to search only for data that has been published? – Yang K Jan 05 '19 at 04:54
  • With scalability in mind, I think the second approach of having two tables may be costly (space-wise). So I'm in favour with your first approach. *Although*, come to think of it, if you want to save previous published data (in case your user wants to revert), it might be better to do so with your second approach. – TrebledJ Jan 05 '19 at 04:56
  • 1
    @YangK: I wouldn't worry about that. First have a robust and clean model to start with. Then for performance you should first an foremost use appropriate indexes. If that doen't help you can think about table partitioning preferably with the means the DBMS provides. If that doesn't help too, you can think about rolling your own -- with great care. But I suppose it's a long way till there. – sticky bit Jan 05 '19 at 04:58
  • 1
    If you want versioning in there do it by saving a version number to each row, not by copying/moving records around from/to different tables. That way you're far more flexible (no need to create a new table for each currently highest version). – sticky bit Jan 05 '19 at 05:03
  • @stickybit https://stackoverflow.com/questions/323065/how-to-version-control-a-record-in-a-database says otherwise. I've heard that most people create a new table for history/versioning. are you saying I shouldn't do that – Yang K Jan 05 '19 at 05:50
  • Hmm, I see a subtle difference in the purpose I had in mind and the purpose the other thread seems to deal with. I was thinking you wanted to provide version control to the users. Like the revisions here for instance, when you edit a post which is a post no matter the version. The other thread, as I get it, is addressing more the audit point of view, like in "have a log" for administrative purposes. That log may have a different structure than the original table(s) it is logging, what may require another table. – sticky bit Jan 05 '19 at 06:09

1 Answers1

0

Case: Binary

They are about equal. Use this as a learning exercise -- Implement it one way; watch it for a while, then switch to the other way.

  • (same) Space: Since a row exists exactly once, neither option is 'better'.
  • (favor 1 table) When "publishing" it takes a transaction to atomically delete from one table and insert into the other.
  • (favor 2 tables) Certain SELECTs will spend time filtering out records with the other value for published. (This applies to deleted, embargoed, approved, and a host of other possible boolean flags.)

Case: Revision history

If there are many revisions of a record, then two tables, Current data and History, is better. That is because the 'important' queries involve fetching the only Current data.

(PARTITIONs are unlikely to help in either case.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Is the solution you provided for `Case: Revision history` still the best approach if you're providing users with version control for their data rather than audit logging. I am not doing this for audit trails. I am doing this for user experience. Is your approach still the best approach when this is the case? – Yang K Jan 06 '19 at 06:12
  • @YangK - Please explain the implementation difference between Revision history and Audit trails. – Rick James Jan 06 '19 at 06:30
  • Thanks for the quick reply, mate! For revision history, the history of the data is part of the business logic. For example, users can go in and restore previous versions of the data and look at the difference between the two versions of the data. For audit trails, the history of the data is only kept for admin purposes to see which records changed at which time, etc. It is not part of the logic, its more of a change log for the data scenario. In my use case, the history of the data will be part of the business logic and isn't simply for admin purposes like it is in an audit trail. Makes sense? – Yang K Jan 06 '19 at 06:34
  • @YangK - What threw me was that your Question sounded like there would be only one old copy of any particular record. Hence, my analogy to a `deleted` flag. Summary: If `Current` is "small" compared to `History`, use 2 tables; else us a flag in one table. If necessary, a `parent_id` column could chain rows together within a single table. – Rick James Jan 06 '19 at 16:49