why update big array makes "change_protection" kernel call dominating CPU?

Question

here are the codes gcc test.c -std=c99

#include <stdio.h>
#include <stdlib.h>
void main() {
        size_t size = (long) 40 * 1024 * 1024 * 1024;
        int* buffer = malloc(size * sizeof(int));
        for (size_t i = 0; i < size; i++) {
                buffer[i] = 1;
        }
        printf("hello\n");
        for (size_t i = 0; i < size; i++) {
                buffer[i] = 2;
        }
        printf("hello\n");
}

160G ram allocated in one go, and traversed twice

the first loop runs happily

however the program kinda stuck inside the 2nd loop

with perf top showing this

Samples: 7M of event 'cpu-clock', Event count (approx.): 14127849698
 74.95%  [kernel]             [k] change_protection
 23.52%  [kernel]             [k] vm_normal_page
  0.40%  [kernel]             [k] _raw_spin_lock
  0.34%  [kernel]             [k] _raw_spin_unlock

with top showing this

top - 10:52:36 up 55 min,  4 users,  load average: 1.16, 1.18, 1.04
Tasks: 240 total,   2 running, 238 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.1%sy,  0.0%ni, 96.9%id,  0.0%wa,  0.0%hi,  0.0%s
Mem:  251913536k total, 170229472k used, 81684064k free,    27820k b
Swap:        0k total,        0k used,        0k free,   352816k cac

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12825 dapeng    20   0  160g 160g  376 R 100.0 66.6  30:38.55 a.out

and the best part is now this program doesn't answer to kill command any more

gcc version gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)

server AWS EC2 r3.8xlarge

Operating System Amazon Linux AMI release 2014.09

This is related question to Why does `change_protection` hog CPU while loading a large amount of data into RAM?

Can you do an experiment for us, please? Change your memory allocation line to read `size_t size = ((size_t) 1) * 1024 * 1024 * 1024;` and re-run the program. It should complete almost immediately. Now increase the 1 to a 2 and run it again. Again, it should complete almost immediately, 2GB is peanuts these days. Keep increasing the number by 1 and re-running the program until it gets stuck. Report the smallest number at which the program gets stuck. — zwol, Oct 21 '15 at 13:37

Marcus Müller · Answer 1 · 2015-10-21T13:26:04.367

2

So what happens is:

you malloc code. That means the kernel maps pages into your program's space; for that, the pages don't even have to be available, its more of a "notify me once you actually access that memory" situation.
your first loop sequentially access the pages. Thus, the kernel frequently has to actually assign real memory to your formerly unused pages. Probably, the AWS hypervisor is fast at providing that memory from its pool
you access these pages again -- but as bad luck might have it, the pages at the start of your 160GB area might be "far" away now, and getting them back close to the CPU your code is running on might be slow. Remember, you're running on a VM on something resembling a NUMA machine.

EDIT the fact that it doesn't react to SIGTERM etc isn't surprising -- there's no explicit context switch happening and no signal checking done in the for loops, so there's no place that your program could react to signals.

edited Oct 21 '15 at 13:26

answered Oct 21 '15 at 11:04

Marcus Müller

34,677
4
53
94

it is taking way too much time, 30minute as shown from top and it has no will to stop, __and__ not responding to my kill command either. very curious, what is the kernel/hypervisor doing now – Dapeng Oct 21 '15 at 11:27
1

I think usage of `madvise()` with `MADV_SEQUENTIAL` might be helpful (or not, I haven't tested this) – Hasturkun Oct 21 '15 at 11:35
1

You're right overall, but you're a little confused about signal handling. If this program weren't stuck *inside the kernel* it would get killed by `SIGTERM` just fine. A signal can be delivered at any point where the CPU is executing user-space instructions, as long as it isn't blocked. `ps -ef` will probably report this process is in `D` state, which means *uninterruptible* I/O wait inside the kernel -- that is the only situation where signals don't work. – zwol Oct 21 '15 at 13:33
@zwol: hm, you seem to be right, but I doubt execution is stuck in kernel land – Marcus Müller Oct 21 '15 at 13:43
@Dapeng: could you run `ps -ef` when your process hangs. – Marcus Müller Oct 21 '15 at 13:44
I see that the OP did post a `ps` line. It's unfortunate that `ps` uses `R` for both "executing instructions in user space" and "executing instructions in the kernel". In the kernel, what Marcus originally said about signals is true more or less: they can't be delivered just anywhere. – zwol Oct 21 '15 at 14:37
1

[Looking at the code](http://lxr.free-electrons.com/source/mm/mprotect.c#L213), I think it's probably spinning inside the loop on lines 227-233, and if you trace downward, that does a *ton* of work which may well involve hypervisor calls for every single page, and there's no check for needing to reschedule or anything. On a uniprocessor this would lock up the entire machine until it finished. – zwol Oct 21 '15 at 14:40
I'm starting to think the answer here is "file a bug report against the kernel". – zwol Oct 21 '15 at 14:41
@zwol: **please** post this as an answer here. – Marcus Müller Oct 21 '15 at 14:51
@MarcusMüller If we can't just leave things as is, which would be my first choice, I would prefer that you edit my observations into *your* answer. – zwol Oct 21 '15 at 17:51
@zwol: Ok, I see your point. I think your comment is clear enough here. Maybe OP should also get in touch with the amazon guys -- people renting humongous VMs and then not being able to sequentially access RAM does indeed sound like a support case. – Marcus Müller Oct 21 '15 at 17:53

why update big array makes "change_protection" kernel call dominating CPU?

1 Answers1

Linked