15

I'm building a program that stores news headlines for companies and its timestamp from various sources.

Let's say the number of company is 1000. It goes like Apple, Google, Microsoft.. etc.

So I can think about two options.

  1. One table with numerous rows (above code is just an example).

    CREATE TABLE news
    (
        news_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
        company VARCHAR(10) NOT NULL,
        timestamp TIMESTAMP NOT NULL,
        source TEXT NOT NULL,
        content TEXT NOT NULL,
        ...
    )
    
    // I also can make company and timestamp as primary keys,
       and news_id will be unique key.*
    
  2. 1000 Tables

    CREATE TABLE news_apple // and news_google, news_microsoft, news_...(x 1000)
    (
        news_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
        timestamp TIMESTAMP NOT NULL,
        source TEXT NOT NULL,
        content TEXT NOT NULL,
        ...
    )
    

Most of the time, I will find the news for the certain company. Let's say there are more than 10000 news for each company. I wonder that if I use a 'WHERE' clause in the first option, it would be slower than the second option.

Which one is more efficient in terms of performance and why?

KimchiMan
  • 4,836
  • 6
  • 35
  • 41
  • 11
    More rows. That is what relational databases are designed for. Multiple tables with the same structure are usually a sign of a bad design. – Gordon Linoff Jan 22 '14 at 03:38
  • Have you considered a nosql-based solution? – theMayer Jan 22 '14 at 03:48
  • @GordonLinoff And can you give me more detail explanation? – KimchiMan Jan 22 '14 at 03:48
  • @rmayer06 Yes, but it's not an option. – KimchiMan Jan 22 '14 at 03:48
  • For example, think "What solution would make it easier to query all news from a given source /all news created after some timestamp". – Can't Tell Jan 22 '14 at 04:55
  • 2
    For once, I completely agree with @GordonLinoff on a style matter :). It's about 1000 times easier to add more rows than it is to add more columns/tables. And queries quickly become nightmares trying to "meld" similar tables back to a main table as if they were the same table. Add column(s) to differentiate types of rows. – Bohemian Jan 22 '14 at 04:58

4 Answers4

19

Relational databases are designed to store many rows per table. There are a whole bunch of mechanisms to facilitate large tables, such as:

  • Indexes on any combination of fields to speed searches
  • Page caching so commonly used pages remain in memory
  • Vertical partitioning (columnar databases) to further speed requests
  • Advanced algorithms such as hash joins and group bys (at least in databases other than MySQL)
  • Use of multiple processors and disks to process queries

There is one thing that is more difficult when putting data in a single table, and that is security. And, in fact, in some circumstances this is a primary concern and basically requires that the data go in separate table. Those applications are rare and far between.

To give an example of how bad storing data in multiple tables could be, imagine that in your system you have one record per company and you store it in a table. This record stores information about the company -- something like name, address, whatever. Call is 100 bytes of information.

In your schema there is a separate table for each "company", so that is one row per table. That record will reside on one data page. A data page could be 16 kbytes, so you are wasting about 15.9 kbytes to store this data. Storing 1000 such records occupies 16 Mbytes instead of about 7 pages worth (112 Kbytes). That can be a significant performance hit.

In addition, with multiple tables you are not taking into account the challenges of maintaining all the tables and ensuring the correctness of data in the different tables. Maintenance updates need to be applied to thousands of tables, instead of a handful.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
7

My experience with Many tables, currently working for a company that has 6000+ tables and growing, is the difficulty in managing the many tables. The upside is you reduce the amount of rows in a table ( I have 30 million rows on other companies) which is great - but only if you have a manageable solution for all your tables.

For example: the 6000+ tables are actually hotel's with their room number Id's (data_room#_hotelID) so I can write stored procedures, bash or php scripts to handle them.

on the flipside, many rows is simpler because you have less tables to deal with, but then you need to start being creative on how to pull the data out. You will definitely want to add an index on your SQL WHERE clauses but that's obvious.

Options I've used in the past to manage large tables are:

1) insert the data with auto-increment in a way that makes sense (like alphabetically) - but seeing that your data changes alot, it's not a good option

2) Loading tables into Memory - I don't know where you are hosting this, but I use MySQL in Memory. I only know of two ways to do this: One is by setting it up on MySQL itself and the other is to have a memory drive (use some of the computer's ram) and move MySQL and it's data to the mounted drive (I've also used Flash drives).

You're also going to want to factor in, whether you are INSERTING or SELECTING data more often.

So to sum up:

  • If you choose to make many tables, be sure they aren't holding huge amounts of data and that you can manage them all.

  • If you go with fewer tables and more rows, use Indexes, RAM drive, Caching to speed up the process of getting the data out of large tables.

That's my two cents, I hope it helps you!

Drace
  • 699
  • 9
  • 23
1

Make one table with all the companies in it, but use foreign keys! You can identify each company by a number and have another lookup table that has each number and the string of the company. Like:

CREATE TABLE news
(
news_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
company_id INT 11,
timestamp TIMESTAMP NOT NULL,
source TEXT NOT NULL,
content TEXT NOT NULL,
...
)

and

CREATE TABLE companies
(
company_id INT 11
company_name VARCHAR(10)
)

Make it explicit with foreign key constraints: http://dev.mysql.com/doc/refman/5.6/en/create-table-foreign-keys.html

This way you will save space by storing numbers repeatedly instead of strings, your indexes will be smaller, and your lookups will be faster.

Andrew Ehrlich
  • 277
  • 3
  • 11
  • Thanks for replying. I know the tables can be separated using foreign keys, but I wanna know the performance for two specific above options. – KimchiMan Jan 22 '14 at 03:58
  • It depends on the type of queries you want to do. If you only ever want to query records from a particular news source then splitting them into multiple tables will perform better. If however you want to query all news sources for particular content, then having one table that is well indexed will perform better as then you don't have to query multiple tables and aggregate the results. – Dijkgraaf Jan 22 '14 at 05:13
0

I generally go with the approach of less tables for storage but have many tables for queries.

The power of the RDMS is to have queries find the data you need as efficiently as possible considering you don’t have the entire data inside RAM.

One way to think about it is how WinAmp and macOS mail works. It does not store the actual data of the songs or email but goes through them and then extracts the meta data that people would query against and store that in their database.

This separation allows you to redesign your schema provided that the information is extra table from the data.

However because of efficiencies made over time with databases. LOB stores are a good enough place to put the primary source of truth. To avoid backing up 2 different stores. This makes your database the single source of caninical truth of the state of the system.

Note it does not mean you should serve the data from the database. I would still have a pointer to an S3 copy or better derive from the ids. Then serve them from S3 for the user.

Archimedes Trajano
  • 35,625
  • 19
  • 175
  • 265