I am running a Flink application using the AWS Kinesis Data Analytics (KDA) service. My KDA Flink application last checkpoint size appears to be growing steadily over time. The sudden drops in checkpoint size you can see in the attached graph correspond with when I pushed changes out to the app, causing it to take a snapshot, update, and then restore from the snapshot. My concern is that once the application is no longer being actively developed, changes will not be deployed as regularly, and the checkpoint size could grow to eventually be too large.
Does anyone know what would cause the checkpoint size to grow continuously without end? I am using State TTL on all significant state and removing state in application code when it is no longer needed. Does the checkpoint size increasing indicate I have a bug in the code that handles state, or is something else potentially at play here?