104

I am an Amazon RDS customer and am experiencing daily amazon RDS write latency spikes, corresponding roughly to the backup window. I will also see spikes at the end of a snapshot (case in point: running a snapshot takes appx 1 hour, and in the final 5 minutes, write latency spikes). I am running a multi-AZ m1.large deployment.

Is there anyone on Stack who can explain how Amazon RDS backup is actually working? I've read the Amazon RDS docs, and as far as I can tell, Amazon RDS is not behaving according to spec. Specifically, these backup/snapshot operations should be hitting my replica, and therefore not causing any downtime/performance hit, or so I thought.

I can distill my problem into six questions:

  • What is technically happening during a snapshot and a backup, and how are they different? (If you answer this question, please tell me if you are able to empirically confirm your answer, or are simply quoting me documentation).
  • Is a spike in write latency to be expected during the backup window on a multi-AZ deployment?
  • Is a spike in write latency to be expected at the end of a snapshot on a multi-AZ deployment?
  • Would my write latency spike be even higher if I was not multi-AZ ?
  • Architecturally, would I be able to avoid these write latency spikes if I rolled my own database running on two m1.large EC2 instances?
  • Are there any configurations I can use that would avoid these write latency spikes while still hosting my DB with RDS, or am I effectively at the mercy of Amazon?

Bonus Question: where and how do you host your mysql database?

I can say that I have been generally happy with RDS except for these daily write latency issues. I love the built-in database monitoring and it was fairly simple to setup and get going.

Thanks!

amazon RDS write latency

esilver
  • 27,713
  • 23
  • 122
  • 168

2 Answers2

83

We also run several RDS instances, in addition to MySQL on some machines that we manage ourselves. I can't comment specifically, as I'm not an Amazon engineer, but several things I've learned that might explain what you're seeing:

  • Although Amazon does not share the backend details 100%, we strongly suspect that they are using their EBS system to back RDS databases.

  • This article helps explain EBS limitations and snapshot functionality http://blog.rightscale.com/2008/08/20/amazon-ebs-explained/ Again, while it's not explicit, it would make sense for Amazon to be using this infrastructure to provide RDS services.

  • Typically, a MySQL backup, in contrast to a snapshot, involves using a tool like mysqldump to create a file of SQL statements that will then reproduce the database. The database does not need to be frozen to do this. With an EBS backend, the best practice is to freeze the database (pause all transactions) while you are snapshotting to avoid data corruption.

  • The spikes you're seeing at the ends of the backup window. If replication is paused by Amazon during the snapshot of your replica, the replica would then need to "catch up" on transactions when the snapshot was complete. This would cause a latency spike.

  • Replication across a multi-AZ deployment is inherently slower then a single AZ deployment. The price you pay for better redundancy.

Joshua
  • 5,336
  • 1
  • 28
  • 42
  • 8
    I can confirm that Amazon RDS is using EBS as the backing store for its RDS databases. The Read Latency and Write Latency graphs in RDS Cloudwatch are effectively describing an EBS instance. Thank you for this answer, it makes sense. – esilver Apr 08 '11 at 18:52
  • 1
    Amazon shares more details in their outage post mortem here http://aws.amazon.com/message/65648/ – Joshua Apr 29 '11 at 19:32
  • @Joshua do you have any thoughts about this (somewhat related) topic? http://stackoverflow.com/questions/6799371/explain-this-memory-consumption-pattern-in-amazon-rds-mysql Thanks! – esilver Jul 24 '11 at 02:28
  • if using a read replica, would that affect the master replica? – Matej Mar 30 '14 at 14:17
  • Because after replication is unpaused the master has accumulated a bunch of records not yet replicated, and have to be sent over all at once – Joshua Apr 26 '14 at 00:22
  • 1
    AWS documentation now states that "A brief I/O freeze, typically lasting a few seconds, occurs during both automated backups and DB snapshot operations on Single-AZ DB instances." http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.BackingUpAndRestoringAmazonRDSInstances.html – baxang Nov 15 '15 at 06:57
  • @Joshua the blog link is broken – mjaggard Mar 16 '22 at 09:21
8

Amazon revealed the basic architecture that they use in Multi AZ deployments. This may help people to take decisions

https://aws.amazon.com/blogs/database/amazon-rds-under-the-hood-multi-az/

Anurag Kale
  • 95
  • 2
  • 9