I have several micro-services running in AWS, some of which communicate with each other, some of them having external clients or being clients to external services.
To implement my services I need a number of secrets (RSA key pairs to sign/verify tokens, symmetric keys, API keys etc). I am using AWS SecretsManager for this, and it works fine, but I'm now in process of implementing proper support for key rotation and I have a few thoughts.
- I am using AWS SecretsManager, fetching secrets periodically (~ 5 minutes) and caching them locally.
- I am using the version stages feature of AWS SecretsManager to reference both AWSCURRENT and AWSPREVIOUS versions, as needed.
Let's say service A needs a key K for service B:
- Let's say at start, K has the current value K1 and the previous value K0.
- Service A will always use (and cache locally) the AWSCURRENT version of K in communication towards B, so in this case K1
- Service B will keep versions AWSCURRENT and AWSPREVIOUS in it's local cache and accept both [K1, K0]
- When rotating K, I first make sure the secret used by service B is rotated, so that after the refresh interval has elapsed, all instances of service B accepts [K2, K1] instead of [K1, K0]. Until the refresh interval has elapsed, all instances of A still uses K1.
- When the refresh interval has elapsed, meaning all instances of B must have fetched K2, I rotate the key for service, so that A will use K1 or K2 until the refresh interval has elapsed, then only K2.
- This completes the key rotation (but if K1 is believed to be compromised, we can rotate B's secret again to push out K1 and get [K3, K2]).
Is this the best approach or are there others to consider?
Then, in some situations I have a symmetric key J that is used within the same service, for example a key to encrypt some session with. So in one request to service C, a session is encrypted with key J1, then needs to be decrypted with J1 at a later stage. I have multiple instances of the C service.
The problem here is that if the same secret is used for both encryption and decryption, rotating it becomes more messy - if the key is rotated to have the value J2 and one instance has refreshed so that it will encrypt with J2, while another instance still doesn't see J2, the decryption will fail.
I can see a few approaches here:
Split into two secrets with separate rotation schemes and rotate one at a time, similar to the above. This adds overhead in terms of extra secrets to handle, with identical values (apart from them being rotated with some time in between)
Let the decryption force a refresh of the secret upon failure:
- Encryption always uses AWSCURRENT (J1 or J2 depending on if refreshed)
- Decryption will try AWSCURRENT then AWSPREVIOUS, and if both fails (because encryption by another instance used J2 and [J1, J0] is stored) will request a manual refresh of the secret ([J2, J1] is now stored), and then try AWSCURRENT and AWSPREVIOUS again.
Use three keys in the key window and always encrypt with the middle one, since it should always be in the window of all other instances (unless it was rotated several times, faster than the refresh interval). This adds complexity.
What other options are there? This seems like such a standard use-case but I still struggled to find the best approach.
EDIT ------------------
Based on JoeB's answer, the algorithm I've come up with so far is this: Let's say that initially the secret has the CURRENT value K1, and PENDING value null.
Normal operation
- All services periodically (every T seconds) query SecretsManager for
AWSCURRENT
,AWSPENDING
and custom labelROTATING
and accept them all (if they exist) -> All services accept [AWSCURRENT
=K1] - All clients use
AWSCURRENT
=K1
Key rotation
- Put a new value K2 for the PENDING stage
- wait T seconds -> All services now accept [
AWSCURRENT
=K1,AWSPENDING
=K2] - Add
ROTATING
to the K1 version + moveAWSCURRENT
to the K2 version + removeAWSPENDING
label from K2 (there seems to be no atomic swapping of labels). Until T seconds have passed, some clients will use K2 and some K1, but all services accept both - wait T seconds -> All services still accept [
AWSCURRENT
=K2,AWSPENDING
=K1] and all clients useAWSCURRENT
=K2 - Remove the
ROTATING
stage from K1. Note that K1 will still have theAWSPREVIOUS
stage. - After T seconds, all services will only accept [
AWSCURRENT
=K2], and K1 is effectively dead.
This should work both for separate secrets and for symmetric secrets used for both encryption and decryption.
Unfortunately I don't know how to use the built-in rotation mechanism for this since it requires several steps with delays in between. One idea is to invent some custom steps and have the setSecret
step create a CloudWatch cron event that will invoke the function again after T seconds, calling it with steps swapPending
and removePending
. It would be awesome if SecretsManager could support this automatically, for example by supporting that the function returns a value indicating that the next step should be invoked after T seconds.