1

Assume a very large database. A table with 900 million records.

Method A:
Table: Posts

+----------+-------------- +------------------+----------------+
| id (int) | item_id (int) | post_type (ENUM) | Content (TEXT) |
+----------+---------------+------------------+----------------+
|    1     |      1        |       user       |  some text ... |
+----------+---------------+------------------+----------------+
|    2     |      1        |       page       |  some text ... |
+----------+---------------+------------------+----------------+
|    3     |      1        |       group      |  some text ... |

// row 1 : User with ID 1 has a post with ID #1
// row 2 : Page with ID 1 has a post with ID #2
// row 3 : Group with ID 1 has a post with ID #3

The goal is displaying 20 records from all 3 post_types in a page.

SELECT * FROM posts LIMIT 20

But I am worried about number of records for this method

Method B:
Separate 900 million records to 3 tables with 300 millions for each one.

Table: User Posts

+----------+-------------- +----------------+
| id (int) | user_id (int) | Content (TEXT) |
+----------+---------------+----------------+
|    1     |      1        |  some text ... |
+----------+---------------+----------------+
|    2     |      2        |  some text ... |
+----------+---------------+----------------+
|    3     |      3        |  some text ... |

Table: Page Posts

+----------+-------------- +----------------+
| id (int) | page_id (int) | Content (TEXT) |
+----------+---------------+----------------+
|    1     |      1        |  some text ... |
+----------+---------------+----------------+
|    2     |      2        |  some text ... |
+----------+---------------+----------------+
|    3     |      3        |  some text ... |

Table: Group Posts

+----------+----------------+----------------+
| id (int) | group_id (int) | Content (TEXT) |
+----------+----------------+----------------+
|    1     |      1         |  some text ... |
+----------+----------------+----------------+
|    2     |      2         |  some text ... |
+----------+----------------+----------------+
|    3     |      3         |  some text ... |

now to get a list of 20 posts to display

SELECT * FROM User_Posts LIMIT 10
SELECT * FROM Page_Posts LIMIT 10
SELECT * FROM group_posts LIMIT 10

// and make an array or object of result. and display in output.

In this method, I should sort them in an array in php, and then semd them to page.

Which method is preferred?
Separating a 900 million records table to three tables will affect on speed of reading and writing in mysql?

Marcin Orlowski
  • 72,056
  • 11
  • 123
  • 141
Pars
  • 4,932
  • 10
  • 50
  • 88
  • Can't you test this (maybe with fewer entries but enough to make the execution time significant if it even is)? – kero Jan 07 '14 at 15:59
  • 2
    This is a hard question. How is the data being populated? What other queries are running on it? Have you considered partitioning? What indexes are on the tables? In general, the table layouts should reflect the logical structure of the data. Are these three different entities in your data? – Gordon Linoff Jan 07 '14 at 15:59
  • 1
    3 diferent tables if they will be used in diferent places or context. 1 table if youll dont care abaut show post filtered by type – Melon Jan 07 '14 at 15:59
  • Also depends on if you want to JOIN those tables or not. – Alp Jan 07 '14 at 16:00
  • 1
    Why do some people have this fixation that a few hundred thousand records in a single table is bad; and that lots of tables with only a few records in is good! A well-indexed table isn't going to cause problems with a few millions of records, especially if you use sharding – Mark Baker Jan 07 '14 at 16:01
  • 5
    You can also investigate [partitioning](http://dev.mysql.com/doc/refman/5.1/en/partitioning.html) the table, where that enum field would be a natural way of splitting things up. effectively you'd get 3 separate tables, but still be able to treat them all as a single one. – Marc B Jan 07 '14 at 16:01
  • of course in some cases JOIN is necessary !! – Pars Jan 07 '14 at 16:01
  • "The goal is displaying 20 records from all 3 post_types in a page". - you answered your own question, one table rather than 3 queries or a union query – AdrianBR Jan 07 '14 at 16:02
  • You probably shouldn't be splitting up the data in the table...as others have pointed out, look into better indexing, partitioning, etc...performance shouldn't really be a huge issue, provided you're careful with indexing. – user2366842 Jan 07 '14 at 16:03
  • so in this table, id of post would be primary key. BUT in some case I would need to select all records with item_id #10. how indexing would help me with that ? – Pars Jan 07 '14 at 16:06
  • 1
    You index the fields that frequently get searched against (generally speaking...) being that the table is very large, you'll probably need to experiment some to find out what works best in your case, i'd suggest a possible index on the item_id field. Do note though, adding an index will slow down data insertion some (it won't be too noticeable with one index most likely, however...) – user2366842 Jan 07 '14 at 16:11
  • only optimize when you encountered a specific issue. don't split the table, work with one table and as it was getting slower (considering that your queries are efficient) then use partitioning.'pre optimization is root of all evil' – Sam Jan 07 '14 at 16:26

1 Answers1

2

This is actually a discussion about Singe - Table - Inheritance vs. Table Per Class Inheritance and missing out joined inheritance. The former is related to Method A, the second to your Method B and Method C would be as having all IDs of your posts in one table and deferring specific attributes for group or user - posts ijto different tables. Whilst having a big sized table always has its negativ impacts related to table full scans the approach of splitting tables has it's own , too. It depends on how often your application needs to access the whole list of posts vs. only retrieving certain post types. Another consideration you should take into account is data partitioning which can be done with MySQL or Oracle Database e.g. which is a way of organizing your data within tables given opportunities for information lifecycle (which data is accessed when and how often, can part of it be moved and compressed reducing database size and increasing the speed for accessing the left part of the data in the table), which is basically split into three major techniques: Range based partitioning, list based partitioning and hash based partitioning. Other features not so commonly supported related to reducing table sizes are the ones dealing with insert's with timestamp invalidating the inserted data automatically after a certain timeperiod has expired. What indeed is a major application design decision and can boost performance is to distinguish between read and writeaccesses to the database at application level. Consider a MySQL - Backend: Because writeaccesses are obviously more critical to database performance then read accesses you could setup a MySQL - Instance for writing to the database and another one as replicant of this for the readaccesses, though this is also discussable, mainly when it comes to RDT (real time decisions), where absolute consistency of data at any given time is a must. Using object pools as a layer between your application and the database also is a technique to improve application performance though I don't know of existing solutions in the PHP world yet. Oracle Hot Cache is a pretty sophisticated example of it. You could build your own one implemented on top of a in - memory database or using memcache, though.

Peter
  • 1,769
  • 1
  • 14
  • 18
  • I am using Redis to cache some data. I am scared somehow with the size of table. it's really a big table!! – Pars Jan 07 '14 at 16:23
  • Haha, I love Redis but still waiting for the official announcement of the redis cluster being productive. But redis heavily supports partitioning techniques, and twinproxy (the one used by twitter for clustering redis afaik) is said to be somewhat automatic about partitioning. – Peter Jan 07 '14 at 16:30
  • Don't be scared by the size of the table until it becomes too big for your disk space. Databases like Postgres and Oracle and MySQL are designed to work efficiently with large volumes of data (billions of records) as long as they are well indexed; and when size does eventually become problematic for _administering_ the table, they provide mechanisms such as sharding to help at that stage. – Mark Baker Jan 07 '14 at 16:47
  • I am not too much into this comment, because: Databases may lock tables temporary at read - and/or write - accesses, applications or CMS as Durpal do so , too, so avoiding unnessesary big tables is a good practice to me. – Peter Jan 07 '14 at 16:50
  • As an example: I have three production sites running Durpal and a MySQL - Database backend. Due to somewhat open settings on comments one of the articles released there contains about 65xxx comments, all comments are stored in one table.Every request to this article is a full scan over the comment table as they are sorted by date. This pushes the memory consuption of my MySQL Instances temporary to about 20%. For a while, the apache HTTP Daemon was misconfigured so that the number of maximum clients per server was getting to low causing segmentation faults. – Peter Jan 07 '14 at 16:57
  • Whithin this scenario the locks by Drupal where increasingly high. So I strongly recommend to take into account exactly what you are doing at every level of your architecture and infrastructure, the summary of not - so - worse can be stunningly worse. – Peter Jan 07 '14 at 16:58
  • That's a case for an db engine that uses row locking rather than table locking: and/or for a read only replicant table for read-only access – Mark Baker Jan 07 '14 at 17:07