MySQL. Is it better (performance) to have one table of 1M records, or 10 tables 100K records each?

Question

This may be asked before, but here's the situation anyway.

I have one big table (on MySQL using InnoDB), that is basically a huge log, no relational fancy stuff.

3 fields: Customer_ID, TimeStamp, Log_Data (which is a tinytext like 'Visited Front Webpage' or 'Logged In').

Since I'm logging the activity of clients in a webpage that receives around 10,000 users a day, that table grows pretty fast.

On a given moment, I've wanted to know how many clients did actually anything on the site.

So I'm running the following query 'SELECT DISTINCT Customer_ID FROM table;', and I've started noticing that as the table grows bigger the query takes longer, which is perfectly fine and totally expected. At one given time the query started taking more than 5 minutes to complete.

I wanted to find a faster way, so I tried this. Let's say that I'm working with a table with 1 million rows. I've started by splitting that table into 10 tables, 100K records each. Then I run 'SELECT DISTINCT Customer_ID FROM table;' on each table, and with all the results I just 'sort | uniq | wc' them on a command line and arrive at the same result.

Surprisingly, that method took less than half the time than the other to execute.

I've pretty much answered the question myself, 10*100K tables is faster than 1*1M table, BUT maybe I'm doing something wrong, maybe is more a problem of a performance tuning or something because tables should be designed to perform well no matter their size.

Let me know what you think.

Thanks for reading.

UPDATE: Here's how I create my table:

CREATE TABLE `mydb`.`mytable` (
 `Customer_ID` BIGINT( 20 ) UNSIGNED NOT NULL,
 `unix_time` INT( 10 ) UNSIGNED NOT NULL,
 `data` TINYTEXT NOT NULL,
KEY `fb_uid` ( `fb_uid` )
) ENGINE = INNODB DEFAULT CHARSET = utf8;

Did you query the 10 tables one after the other? If you did them all at the same time it's actually a performance DRAIN because it's hogging CPU/RAM — J V, Apr 16 '11 at 19:15
1 table 500 million rows a query that covers 15 million rows in 0.02 seconds - because it's taking advantage of well designed clustered primary key. Are you ?? http://stackoverflow.com/questions/4419499/mysql-nosql-help-me-to-choose-the-right-one-on-a/4421601#4421601 — Jon Black, Apr 16 '11 at 19:50
Can you edit the question and add the structure of your table (types of fields and indexes) ? — ypercubeᵀᴹ, Apr 16 '11 at 20:16

score 2 · Answer 1 · answered Apr 16 '11 at 19:48

While your 100K*10 solution does make the query faster, it sounds hard to maintain and probably not best approach.

"tables should be designed to perform well no matter their size"

You must realize this can't be true when the tables get too large for the DB engine you are using.

So what can you do? The solution probably concerns the types of queries you run on this data.

Is the query above the only one using this data?
If not, what other queries are running on that table?

One rule of thumb here is don't store data you are not going to need. Another one is store the data in a way it's easy to query - Even if you do need the 1M rows of raw data, you can still store some aggregated data (or meta-data) in another table, e.g. a table of unique customer_id's per day, which is calculated at end of day.

score 2 · Answer 2 · answered Apr 16 '11 at 23:19

You need an index that starts with Customer_ID for your query to be fast. If you have an index that simply contains it then it won't be able to use it as optimally. Here's how you can create it:

CREATE INDEX idx_cid ON table (Customer_ID)

As well you can get your count straight from the database with:

SELECT COUNT(DISTINCT(Customer_ID)) FROM table

If you ever want to narrow it to a range of time then you'd need a composite index:

CREATE INDEX idx_ts_cid ON table (TimeStamp, Customer_ID)

Then the query would be something like this for last month:

SELECT COUNT(DISTINCT(Customer_ID)) FROM table
WHERE TimeStamp BETWEEN "2011-03-01" AND "2011-04-01"

soulkphp · Answer 3 · 2011-04-18T03:44:12.213

1

To add to the others, since you said that you aren't doing any "fancy relational stuff" you might also want to consider using a database solution geared towards massive datasets (and simple tables). MongoDB is one example.

I should add that this would only make sense if the rest of your database schema is also very large and non-relational :)

edited Apr 18 '11 at 03:44

answered Apr 16 '11 at 20:33

soulkphp

3,753
2
17
14

I'll definetely try out MongoDB. Do you have any link w/ performance data? I'm using MySQL because of the ease of installation/configuration. – Ale Morales Apr 17 '11 at 16:50
@almosnow Here's MongoDB's page of benchmarks: http://www.mongodb.org/display/DOCS/Benchmarks – soulkphp Apr 18 '11 at 03:48

score 0 · Answer 4 · answered Apr 16 '11 at 19:17

0

It seems you don't have index on user_id field or one user has many rows say 40000 rows out of a million.

answered Apr 16 '11 at 19:17

Ashwini Dhekane

2,280
14
19

Customer_ID is the index actually. I don't know if one has user has the majority of rows, let me check that out. – Ale Morales Apr 16 '11 at 19:19

MySQL. Is it better (performance) to have one table of 1M records, or 10 tables 100K records each?

4 Answers4