I'm having issues to monitor the memory usage in my aws ECS (Fargate) cluster. I have seen in the cloudwatch dashboard that the memory seemed not correctly garbage collected and so the autoscaling policy continues to add new instances.
So I added these logs every time I press a button in my web application (the one deployed in aws ECS):
public static void memoryCheck() {
final long maxHeapSize = Runtime.getRuntime().maxMemory();
final long currentHeapSize = Runtime.getRuntime().totalMemory();
final long freeMemory = Runtime.getRuntime().freeMemory();
final long usedMemory = currentHeapSize - freeMemory;
final float usagePercentage = 100f * ((float)usedMemory / (float)maxHeapSize);
log.info("max heap size: {}", maxHeapSize);
log.info("current heap size: {}", currentHeapSize);
log.info("free memory: {}", freeMemory);
log.info("used memory: {}", usedMemory);
log.info("usage percentage: {}%", usagePercentage);
}
I ran some computations and in the meantime I used this query in cloudwatch:
fields MemoryUtilized, TaskId
| stat avg(100* (MemoryUtilized/MemoryReserved)) as usedMemoryPercentage by bin(1m)
| limit 200
And I compared these results with what I see in my application logs:
[2022-07-31 15:05:45,409] usage percentage: 4.680836%
[2022-07-31 15:07:15,255] usage percentage: 28.25417%
[2022-07-31 15:10:35,612] usage percentage: 9.401683%
[2022-07-31 15:10:37,186] usage percentage: 12.7543%
[2022-07-31 15:11:45,724] usage percentage: 13.398229%
[2022-07-31 15:11:46,983] usage percentage: 17.017136%
[2022-07-31 15:12:02,926] usage percentage: 17.581753%
[2022-07-31 15:12:04,195] usage percentage: 20.910671%
[2022-07-31 15:12:05,142] usage percentage: 21.470863%
[2022-07-31 15:12:06,411] usage percentage: 25.008553%
[2022-07-31 15:12:07,426] usage percentage: 25.564808%
[2022-07-31 15:12:08,733] usage percentage: 28.876656%
[2022-07-31 15:12:09,682] usage percentage: 29.434296%
[2022-07-31 15:12:11,009] usage percentage: 32.670258%
[2022-07-31 15:12:11,884] usage percentage: 33.084824%
[2022-07-31 15:12:13,361] usage percentage: 5.8098726%
[2022-07-31 15:12:14,170] usage percentage: 6.2343636%
[2022-07-31 15:12:15,369] usage percentage: 9.761842%
[2022-07-31 15:12:16,261] usage percentage: 10.529714%
[2022-07-31 15:12:17,516] usage percentage: 13.84009%
[2022-07-31 15:12:18,319] usage percentage: 14.395365%
[2022-07-31 15:12:19,648] usage percentage: 17.555946%
[2022-07-31 15:12:38,523] usage percentage: 18.108969%
[2022-07-31 15:12:39,829] usage percentage: 21.39381%
[2022-07-31 15:15:00,594] usage percentage: 22.144833%
[2022-07-31 15:15:02,086] usage percentage: 25.362867%
[2022-07-31 15:20:40,696] usage percentage: 26.10088%
[2022-07-31 15:20:41,987] usage percentage: 29.37069%
[2022-07-31 15:20:53,282] usage percentage: 29.936934%
[2022-07-31 15:20:54,626] usage percentage: 5.3704033%
[2022-07-31 15:20:55,472] usage percentage: 5.906469%
[2022-07-31 15:20:56,626] usage percentage: 9.137836%
[2022-07-31 15:20:57,673] usage percentage: 9.693291%
[2022-07-31 15:20:58,844] usage percentage: 13.019199%
[2022-07-31 15:20:59,950] usage percentage: 13.584228%
[2022-07-31 15:21:01,196] usage percentage: 16.747267%
[2022-07-31 15:21:02,111] usage percentage: 17.29407%
[2022-07-31 15:21:03,343] usage percentage: 20.61859%
[2022-07-31 15:21:04,242] usage percentage: 21.172577%
[2022-07-31 15:21:05,502] usage percentage: 24.336498%
[2022-07-31 15:27:46,785] usage percentage: 25.272879%
[2022-07-31 15:27:48,120] usage percentage: 28.3611%
[2022-07-31 15:27:50,839] usage percentage: 28.914537%
[2022-07-31 15:27:52,085] usage percentage: 32.04773%
[2022-07-31 15:27:54,059] usage percentage: 4.9811816%
[2022-07-31 15:27:55,317] usage percentage: 8.339294%
As you can see in the application logs the situation seems ok, instead in cloudwatch it seems the memory is not garbage collected correctly and in general there are mismatches (example: look at 15:12:13)
What do you think can be the issue? Am I missing something?
P.S. The "experiment" has been done in ECS with 1 service and 1 task with autoscaling disabled And the java process is initialized with these params: "-XX:+UseParallelOldGC", "-Xms11G", "-Xmx11G"