1

I am running some memory intensive service on AWS (m4.15xlarge and m5.12xlarge instances) and have noticed that in certain conditions (usually after a certain memory limit) the time it takes to allocate additional memory increases dramatically (up to 50x times).

When it is happening htop shows 100% in kernel mode (red) and perf top looks like this:

  62.82%  [kernel]                                           [k] pageblock_pfn_to_page                                                              
  15.21%  [kernel]                                           [k] clear_page_c_e                                                                     
   5.78%  [kernel]                                           [k] get_pfnblock_flags_mask                                                            
   3.06%  [kernel]                                           [k] compaction_alloc                                                                   
   1.59%  [kernel]                                           [k] clear_huge_page                                                                    
   1.49%  [kernel]                                           [k] _raw_spin_lock                                                                     
   1.41%  [kernel]                                           [k] async_page_fault                                                                   
   1.29%  a.out                                              [.] memTest                                                                            
   0.99%  [kernel]                                           [k] get_page_from_freelist                                                             
   0.85%  [kernel]                                           [k] compact_zone                                                                       
   0.69%  [kernel]                                           [k] try_charge                                                                         
   0.51%  [kernel]                                           [k] error_entry                                                                        
...

During normal operation it looks like this:

  66.29%  [kernel]                                           [k] clear_page_c_e
   7.05%  [kernel]                                           [k] clear_huge_page
   3.91%  a.out                                              [.] memTest
   3.66%  [kernel]                                           [k] _raw_spin_lock
   3.12%  [kernel]                                           [k] async_page_fault
   2.68%  [kernel]                                           [k] get_page_from_freelist
   1.93%  [kernel]                                           [k] _cond_resched
   1.49%  [kernel]                                           [k] try_charge
   1.12%  [kernel]                                           [k] error_entry
   1.01%  [kernel]                                           [k] retint_user
   0.93%  [kernel]                                           [k] handle_mm_fault
   0.77%  [kernel]                                           [k] mem_cgroup_try_charge
   0.67%  [kernel]                                           [k] pmd_pfn
   0.66%  [kernel]                                           [k] __rmqueue.isra.80
...

I am not entirely understand what is triggering the behavior. Sometime it is rather hard to reproduce but sometimes it is happening all the time.

I have a hypothesis that this is related to AWS virtualization (so the fact that the issue is intermittent I attribute to what is happening on the "neighbors") (see Update below). I also was unable to reproduce the problem on m5.metal machine.

I was able to reproduce the problem using a simple C program that allocates and initializes memory in a loop:

void memTest(long chunk, long total) {
    struct timeval prev, cur = {0,0}, lastProgress = {0,0};
    int i, j;
    int num = total / chunk;
    int p, progress = -1;
    uint8_t *data[num];

    get_now(&prev);

    for (i = 0; i < num; i++) {
        data[i] = malloc(chunk);
        for (j = 0; j < chunk; j += 4096) {
            data[i][j] = rand()%265;
        }

        get_now(&cur);
        add(delta(&prev, &cur));
        prev = cur;

        p = (i * 20) / num * 5;

        if (p != progress) {
            if (lastProgress.tv_sec == 0) {
                printf("%s: %02d%%\n", format(&cur), p);
            } else {
                double elapsed = delta(&lastProgress, &cur);
                printf("%s: %02d%% (%gms)\n", format(&cur), p, elapsed);
            }
            lastProgress = cur;
            progress = p;
        }
    }
}
m5.12xlarge$ ./a.out --total 182714368000 --chunk 16777216
2019-03-27 05:03:22.805827: 00%
2019-03-27 05:03:25.035575: 05% (2229.75ms)
2019-03-27 05:03:27.244955: 10% (2209.38ms)
2019-03-27 05:03:29.458160: 15% (2213.2ms)
2019-03-27 05:03:31.665313: 20% (2207.15ms)
2019-03-27 05:03:33.871949: 25% (2206.64ms)
2019-03-27 05:03:36.075955: 30% (2204.01ms)
2019-03-27 05:03:38.284512: 35% (2208.56ms)
2019-03-27 05:03:40.489039: 40% (2204.53ms)
2019-03-27 05:03:42.697444: 45% (2208.41ms)
2019-03-27 05:03:44.902268: 50% (2204.82ms)
2019-03-27 05:03:47.110703: 55% (2208.43ms)
2019-03-27 05:03:49.315001: 60% (2204.3ms)
2019-03-27 05:03:51.523370: 65% (2208.37ms)
2019-03-27 05:03:53.728535: 70% (2205.16ms)
2019-03-27 05:03:55.936081: 75% (2207.55ms)
2019-03-27 05:03:58.141149: 80% (2205.07ms)
2019-03-27 05:04:00.349740: 85% (2208.59ms)
2019-03-27 05:04:02.553894: 90% (2204.15ms)
2019-03-27 05:04:04.762675: 95% (2208.78ms)
2019-03-27 05:04:41.470692: 100% (36708ms) .  <---

This time I was only able to get the problem close to the memory limit, but I was able to get it even on 20Gb (out of 186Gb available).

I would greatly appreciate if someone could shade light on what is going on and how to get rid of this effect.

Update:

I have played with the issue a bit more and currently blaming Transparent Hugepage Support (THP) that happen to be enabled (always) on m5.12xlarge and disabled (madvise) on m5.metal. After toggling the setting on the machines I was able to get the issue on m5.metal and problem went away on m5.12xlarge.

Sergey Zimin
  • 51
  • 1
  • 6
  • 1
    Amazon EC2 instances are isolated from "neighbours". Memory is not oversubscribed, so "neighbours" will never impact CPU, Memory or disk access. – John Rotenstein Mar 27 '19 at 07:18
  • @JohnRotenstein, I think you are right and my hypothesis is wrong. I have played at bit more and it looks like the issue is caused by [Transparent Hugepage Support](https://www.kernel.org/doc/Documentation/vm/transhuge.txt). I have checked that this feature was enabled on `m5.12xlarge` and disbaled `m5.metal`. – Sergey Zimin Mar 27 '19 at 16:28

1 Answers1

4

As it is suggested in Update section, the issue seems to be caused by Transparent Huge Page (THP) support. It already has been found problematics by MongoDB and some others link, link and recommended for disabling.

There are numerous questions on how to disable THP on stackoverflow and stackexchange. For example:

Sergey Zimin
  • 51
  • 1
  • 6