2

Every month I get sent a file from a external company which needs to be stored in a database, each file containing up to a million records. The main data fields are Month, Year, Postcode and TransactionType.

I was proposing that we should save the data in our database as a new SQL table each month so we know there is only a finite amount of data in each table. However one of my collegues said he was once told that to create a new table every month is bad practice, but he didn't know why.

If I was to have multiple tables, there would only be a maximum of 60 tables, though there may be far fewer (down to 12) dependent on how far into the past my client needs to look. This means that every month I will need to delete a month's worth of data.

However when I do my SQL queries I will only need a single row of data from a single table per query. I would think in theory this would be more efficient than having a single table filled with millions of rows.

I was wondering if anyone had any definitive reasons as to why splitting the data this way would be a bad thing to do?

RyuAkamatsu
  • 229
  • 5
  • 15
  • Running select queries is much simpler if it's only one table. – Dan Bracuk Nov 05 '15 at 17:09
  • This question will probably be closed as a duplicate soon. If the answers from the duplicates (such as http://stackoverflow.com/a/21283725/3854195) don't fully address your question please edit it to include why and flag this for re-opening. Thanks! – Morpheus Nov 05 '15 at 17:20

3 Answers3

3

All "like" items should be stored together in a database for the following reasons:

  • You should be able to provide any subset of the items using a single SELECT statement only by changing the WHERE clause of that statement. With separate tables you will have to write code to decompose the request into the parts that compute the table name and the parts that filter that table. And you will have to duplicate that logic in each application, or teach it to each user, that wants to use your database.

  • You should not artificially limit the use to which your data can be put. If you have separate monthly tables you have already substantially limited the types of queries you can enter against them without having to write more complex UNION queries.

  • The addition of more instances of a known data type to your database should not require ALTERing the structure of your database and, as a general principal, regularly-run code should not even have ALTER permissions

  • If proper indexes are maintained, there is very little performance difference when SELECTing data from a table 60 times the size of a smaller table. (There can be more effect on INSERT and UPDATE commands but it sound like you'll be doing a bulk update rather than updating the data constantly).

I can think of only two reasons for sharding data into separate tables:

  • You discover that you have a performance issue that can't be resolved through better data design.

  • You have records with different level of security and are relying on GRANT SELECT permissions to allow some users to see the records at higher levels of security.

Larry Lustig
  • 49,320
  • 14
  • 110
  • 160
  • There are actually going to be permission groups but they are designated on postcode. Should that be a factor in the database table design? – RyuAkamatsu Nov 05 '15 at 18:46
  • Depends on the level of security you need and the nature of your application(s). Absolute security can only be guaranteed by partitioning the data according to security level and then granting access appropriately. [You may be able to create `VIEW`s to hide the data and `GRANT` on those; I don't know if that's available in MySQL]. Otherwise you're reliant on application-level logic to hide the data from the users. – Larry Lustig Nov 05 '15 at 19:00
0

A simpler method would be to add a column to that table which contains a datetimestamp of when that was loaded into the system. That way you can filter by that perticular column to segregate that data into the months/years that it was loaded in.

Another advantage from a performance perspective, that if you regularly filter data this way, you can create an index based on this date column.

Having multiple tables that contain the same information is not recommended for performance reasons and how information is stored in SQL. Eventually it will take up more space and if one month's data needs to reference another month's data it will be quite slow.

Hope this helps.

Herman
  • 300
  • 2
  • 9
0

If you think it isn't difficult for you to manage your application, you can do it.

Example. Do you need to change SQL queries every month? If user need more report that need data more than 1 month, What happen?

Using partitioning, DBMS will split your data to multiple table on the physical storage but You can call all of them by the same name. DBMS will analyse with partition it should take. Performance isn't different significantly.