6

I have a mysql db which has around 150 millions inserts per day and retention period is around 60 days.

  1. Each record is indexed on id.
  2. Everytime a update happens as follows:
    1. Look if record is present. If it is , update the same with new data.
    2. Or else create the data.
  3. Delete records which are created more then 60 days before.

My main use case is follows:

Run some bulk queries. eg.:

Select (*) from table where prop=val1 and prop2=val2 etc

Will returns large no of records eg. 1M

Is following approach good:

  1. Have a master DB with index on id only. Have a retention of 60 days.
  2. Have Read Replica DB. This DB will be indexed on many columns
  3. All bulk queries will be run against read replica DB.

Is this a good solution?

EDIT : I plan to use Amazon RDS DB and found this in their documentation:

 Q: Can my Read Replicas only accept database read operations?

Read Replicas are designed to serve read traffic. However, there may be use cases where advanced users wish to complete Data Definition Language (DDL) SQL statements against a Read Replica. Examples might include adding a database index to a Read Replica that is used for business reporting, without adding the same index to the corresponding source DB Instance. If you wish to enable operations other than reads for a given Read Replica, you will need to modify the active DB Parameter Group for the Read Replica, setting the “read_only” parameter to “0.”

Sebi2020
  • 1,966
  • 1
  • 23
  • 40
user93796
  • 18,749
  • 31
  • 94
  • 150

5 Answers5

5

To answer your question:

Is following approach good:

  1. Have a master DB with index on id only. Have a retention of 60 days.
  2. Have Read Replica DB. This DB will be indexed on many columns
  3. All bulk queries will be run against read replica DB.

Is this a good solution?

Updated

In my opinion and experience, No.

Technically, this solution may work, but practically not suitable for production use. The built in master-slave replication of mysql, works only if the table in the slave database has the same layout as the table in the master database.

You will have approximately 9 billion records (150 x 60). My estimate is on disk this could take up to 1TB (each record the size of a tweet). 150 million inserts and 150 million deletes (of expired records) will surely make indexes fragmented and inserts slower, requiring re build frequently.

Things will get incrementally more complicated when you need more than one read replica, a natural evolution of the ecosystem.

If you have 150 million inserts a day, you should consider a NOSQL database. Mongodb used to support Innodb as well, not sure if it still does.

If you wish to stick to an RDBMS like MySQL, you should use strategy such as Database Sharding. In this strategy, you segment your data in such a way that the load gets distributed across a cluster of MySQL instances.

A slightly less scalable than Sharding is to use a storage engine such as MyISAM. MyISAM is not fully ACID compliant but offers great performance. It supports concurrent inserts.

Community
  • 1
  • 1
Litmus
  • 10,558
  • 6
  • 29
  • 44
1

Consider using Fastbit if your primary use is SELECT * with no joins and multiple filters on different columns. Fastbit implements WAH compressed bitmaps that can be evaluated very efficiently and stores data as a column store.

https://sdm.lbl.gov/fastbit/

For MySQL, perhaps consider TokuDB which has 'clustered' index support, or creating covering indexes in InnoDB. This is really only effective if you have a small combination of attributes to filter on. If not, consider fastbit.

If you always filter on the same attributes, then you can consider using Flexviews: http://flexvie.ws

You could create a view for select * from table where val1=X and val2=Y

or just roll your own version. after loading data do: replace into summary_table_v2v2 select * from table where val1=X and val2=Y and table.last_update > NOW()-INTERVAL 1 DAY;

That will "refresh" the table with any changes made in the last day, assuming last_update is a timestamp column.

Justin Swanhart
  • 1,826
  • 13
  • 15
1

The answer from @eternal-learner is not correct.

Yes, the approach you outlined can be a good approach. You need to use a few precautions:

  1. Ensure master-slave replication is working before making index changes

  2. Make all index changes in the slave only, and be sure to make only indexes changes that can't break the logic of the data model (i.e. do not introduce new unique indexes / constraints)

  3. Ensure that the slave can't be promoted to be the master in a failover situation, or you will end up with a lower-performance master with different indexes than any other slave(s) in the group

Also -- be careful about how you do the update-or-insert. It's easy to have a race condition there.

Chris Johnson
  • 20,650
  • 6
  • 81
  • 80
  • keeping in mind the site receives 150 million inserts a day, trying to get this to work once as an academic exercise is fine. But is not a feasible solution to use in production. When things break, recovery will be complex and time consuming (rebuild all new indexes). Moreover, Mysql replication is asynchronous. With extra indexes on slave, the syncing is bound to lag behind. – Litmus Oct 17 '13 at 14:48
  • I agree that if synchronous / realtime operation is needed, then this isn't a great solution. However if the bulk queries can be run at defined times (maybe even pause replication at the right moment in time), and if when the rare failure happens a rebuild period for the read replica is tolerable, then I think it would work fine. Those are operational tradeoffs -- whether they're important in this case or not isn't apparent in the OP. – Chris Johnson Oct 17 '13 at 15:14
  • @Eternal-Learner In my case i dont need real time data.It is ok if the slave lags master by 2 hours. And i don intent o promote my slave to master in case of failure – user93796 Oct 18 '13 at 15:22
  • @user93796 You are the best judge since you have more data points about the use case. A back of the envelop calculation tells me you will have 9 billion records (150 million x 60 days). Depending on the size of each record (assuming equal to one tweet), On disk it could be close to 1 TB. 150 million deletes (expired records) and 150 million inserts every day! – Litmus Oct 18 '13 at 16:30
  • @user93796, there are alternative solutions, but to answer your original question -- yes the approach you outlined can work. – Chris Johnson Oct 18 '13 at 16:33
  • @ChrisJohnson; @Eternal-Learner ; I found some documentation regarding this on amazo aws RDS website.Please check my last edit – user93796 Oct 23 '13 at 07:14
  • Yes, technically speaking, that is what everyone on this thread has said. It is doable. But my hesitation came from the volume of transactions you have mentioned. I just feel it is impractical to use for such heavy loads. Other strategies I mentioned, are specifically there to address your kind of needs. Cloud, such as RDS has an advantage in that you can easily discard a read replica can create a new one. If you host on your own servers, it wont be that simple. – Litmus Oct 23 '13 at 07:39
  • @Eternal-Learner, I agree those are valid concerns. – Chris Johnson Oct 23 '13 at 13:15
0

Clustered Indexes

Whether you use replication database but your design database is not bigger tables oriented there will not any changes in your performance.

I suggest you, review your design after read these links:

InnoDb Index Types

Here you can found some examples about clustered indexes only with innodb tables.

60 million entries, select entries from a certain month. How to optimize database?

It works with 60 to 500 millions rows.

Search Engine

In other alternative you can use a search engine like Sphinx is open source but your database design should be in denormalized mode where you multi column where transforms into one column where, for example:

Select (*) from table where prop=val1 and prop2=val2 and prop3=val3 ..

make a unique column index like this: val_tot = concat(val1, val2, val3,..)

Select (*) from table where prop_key = val_tot;
Community
  • 1
  • 1
randiel
  • 290
  • 1
  • 16
0

I haven't tried yet but i don't think the replication supports different table structure between master and slave. i don't found any documentation from Mysql for that. The idea is mysql will replay binary logs from master to slaver time to time, so that all structure should be the same to avoid the conflicting.

To deal with huge database problem, another option is mysql partitioning, or you can have a script to recompute huge data into small data with good index.

Hắc Huyền Minh
  • 1,025
  • 10
  • 13