8

We are running a custom OpenX ad server on a MySQL database which gets approx. 1 million clicks / day. We need to store all this click information and show statistics based on it.

Right now, all the click information is aggregated every 2 days and the specific click info is deleted. But we want to provide a our affiliates with a new feature which will allow them to set a dynamic tracking id (TID) and, basically, track their clicks and conversions based on this.

So, the problem is that our click table will grow by a minimum of 1 million entries a day, and we need to be able to search this table and show all the clicks for one user for a specific period of time, grouped by the TID I mentioned above, or search by the TID.

I had a look at MySQL partitioning and it seems like a good solution, but, I'm not sure if it will still work well on a HUGE database (maybe billions of entries).

What do you think would be the correct approach for this issue?

EDIT:

Based on your answers, I'm now thinking of a mixed solution.

We already have a "LIVE" table from which the entries are deleted when the clicks are aggregated at maintenance time, which looks something like this:

Table: clicks

viewer_id | ... | date_time | affiliate_id | ... | tid

(I skipped the columns which are unimportant at this point)

At maintenance time, I can move everything to another monthly table which looks almost the same, say Table: clicks_2012_11, which has indexes for date_time, affiliate_id and tid and is partitioned by the affiliate_id.

So now, when an affiliate wants to see his statistics for the past 2 months, I know I have to look inside the Table: clicks_2012_10 and the Table: clicks_2012_11 (I will have the time range limited to a maximum of 2 months). Because I have the tables partitioned by affiliate_id, only the needed partitions will be searched from the 2 tables and I can now list all the TIDs which had any activity in the past 2 months.

What do you think about this approach? Are there any obvious issues? Am I over complicating things without a solid reason?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user1782560
  • 317
  • 2
  • 7

2 Answers2

2

There is nothing inherent in big (even "huge") tables that makes MySQL fail. Big tables are mostly a problem in terms of:

  • disk space
  • cache usage (you are likely not to be able to run in memory)
  • maintenance (schema changes, rebuilds, ...)

You need to address all of these.

Partitioning is mainly useful for bulk data maintenance such as dropping entire partitions. It is certainly not a best-practice to partition big tables by default on just some column. Partitioning is always introduced for a specific reason.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Thanks for your input. I'm thinking about partitioning the table by the affiliate_id since this affiliate_id will be present in all the WHERE clauses for all the queries. When I try to get all the stats for the last 2 months for a particular affiliate id, wouldn't that help in speeding up the queries? What would be the downsize of this approach? – user1782560 Oct 29 '12 at 18:03
  • You don't need partitioning for that. Cluster the table on `affiliate_id, date_time desc`. – usr Oct 29 '12 at 23:00
1

Optimizing for insertion and optimizing for retrieval are usually mutually exclusive. You might be better off with two tables:

live data: no (or minimal) keys, myisam to remove transaction overhead, etc...
historical data: indexed up the wazoo, with data moved over from the live data on a periodic basis.
Marc B
  • 356,200
  • 43
  • 426
  • 500