20

On a Linux system I am trying to call a program at runtime with the system() call. The system call exits with an return code not equal zero.

Calling WEXITSTATUS on the error code gives "127".

According to the man page of system this code indicates that /bin/sh could not be called:

In case /bin/sh could not be executed, the exit status will be that of a command that does exit(127).

I checked: /bin/sh is a link to bash. bash is there. I can execute it from the shell.

Now, how can I find out why /bin/sh could not be called ? Any kernel history or something?

Edit:

After the very helpful tip (see below) i strace -f -p <PID> the process. This is what I get during the system call:

Process 16080 detached
[pid 11779] <... select resumed> )      = ? ERESTARTNOHAND (To be restarted)
[pid 11774] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0, NULL) = 16080
[pid 11779] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 11779] rt_sigaction(SIGCHLD, {0x2ae0ff898ae2, [CHLD], SA_RESTORER|SA_RESTART, 0x32dd2302d0},  <unfinished ...>
[pid 11774] rt_sigaction(SIGINT, {0x2ae1042070f0, [], SA_RESTORER|SA_SIGINFO, 0x32dd2302d0},  <unfinished ...>
[pid 11779] <... rt_sigaction resumed> {0x2ae0ff898ae2, [CHLD], SA_RESTORER|SA_RESTART, 0x32dd2302d0}, 8) = 0
[pid 11779] sendto(5, "a", 1, 0, NULL, 0 <unfinished ...>
[pid 11774] <... rt_sigaction resumed> NULL, 8) = 0
[pid 11779] <... sendto resumed> )      = 1
[pid 11779] rt_sigreturn(0x2 <unfinished ...>
[pid 11774] rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x32dd2302d0},  <unfinished ...>
[pid 11779] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 11779] select(16, [9 15], [], NULL, NULL <unfinished ...>
[pid 11774] <... rt_sigaction resumed> NULL, 8) = 0
[pid 11774] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 11774] write(1, "Problems calling nvcc jitter: ex"..., 49) = 49
[pid 11774] rt_sigaction(SIGINT, {0x1, [], SA_RESTORER, 0x32dd2302d0}, {0x2ae1042070f0, [], SA_RESTORER|SA_SIGINFO, 0x32dd2302d0}, 8) = 0
[pid 11774] rt_sigaction(SIGQUIT, {0x1, [], SA_RESTORER, 0x32dd2302d0}, {SIG_DFL, [], SA_RESTORER, 0x32dd2302d0}, 8) = 0
[pid 11774] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 11774] clone(Process 16081 attached (waiting for parent)
Process 16081 resumed (parent 11774 ready)
child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7fff0177ab68) = 16081
[pid 16081] rt_sigaction(SIGINT, {0x2ae1042070f0, [], SA_RESTORER|SA_SIGINFO, 0x32dd2302d0},  <unfinished ...>
[pid 11774] wait4(16081, Process 11774 suspended
 <unfinished ...>
[pid 16081] <... rt_sigaction resumed> NULL, 8) = 0
[pid 16081] rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x32dd2302d0}, NULL, 8) = 0
[pid 16081] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 16081] execve("/bin/sh", ["sh", "-c", 0xdda1d98], [/* 58 vars */]) = -1 EFAULT (Bad address)
[pid 16081] exit_group(127)             = ?
Process 11774 resumed

When it comes to the call to /bin/sh it says bad address. Why that ?

Edit:

Here the whole part that involves the failing system (here already the safe copy to a buffer is in place):

  std::ostringstream jit_command;

  jit_command << string(CUDA_DIR) << "/bin/nvcc -v --ptxas-options=-v ";
  jit_command << "-arch=" << string(GPUARCH);
  jit_command << " -m64 --compiler-options -fPIC,-shared -link ";
  jit_command << fname_src << " -I$LIB_PATH/include -o " << fname_dest;

  string gen = jit_command.str();
  cout << gen << endl;

  char* cmd = new(nothrow) char[gen.size()+1];
  if (!cmd) ___error_exit("no memory for jitter command");
  strcpy(cmd,gen.c_str());

  int ret;

  if (ret=system(cmd)) {

    cout << "Problems calling nvcc jitter: ";

    if (WIFEXITED(ret)) {
      printf("exited, status=%d\n", WEXITSTATUS(ret));
    } else if (WIFSIGNALED(ret)) {
      printf("killed by signal %d\n", WTERMSIG(ret));
    } else if (WIFSTOPPED(ret)) {
      printf("stopped by signal %d\n", WSTOPSIG(ret));
    } else if (WIFCONTINUED(ret)) {
      printf("continued\n");
    } else {
      printf("not recognized\n");
    }

    cout << "Checking shell.. ";
    if(system(NULL))
      cout << "ok!\n";
    else
      cout << "nope!\n";

    __error_exit("Nvcc error\n");

  }
  delete[] cmd;
  return true;

Output:

/usr/local/cuda/bin/nvcc -v --ptxas-options=-v -arch=sm_20 -m64 --compiler-options -fPIC,-shared -link bench_cudp_Oku2fm.cu -I$LIB_PATH/include -o bench_cudp_Oku2fm.o
Problems calling nvcc jitter: exited, status=127
Checking shell.. ok!

Edit (first version of the code):

string gen = jit_command.str();
cout << gen << endl;
int ret;
if (ret=system(gen.c_str())) {
  ....

The complexity of the string creation is not the problem here. As strace shows a "bad address" is the problem. Its a legal string. A "bad address" should not occur.

As far as i know the std::string::c_str() returns a const char * that might point to a scratch space of libc++ where a read only copy of the string might be kept.

Unfortunately the error is not really reproduceable. The call to system succeeds several times before it fails.

I don't want to be hasty but it smells like a bug in either in the kernel, libc or the hardware.

Edit:

I produced a more verbose strace output (strace -f -v -s 2048 -e trace=process -p $!) of the failing execve system call:

First a succeeding call:

[pid  2506] execve("/bin/sh", ["sh", "-c", "/usr/local/cuda/bin/nvcc -v --ptxas-options=-v -arch=sm_20 -m64 --compiler-options -fPIC,-shared -link /home/user/toolchain/kernels-empty/bench_cudp_U11PSy.cu -I$LIB_PATH/include -o /home/user/toolchain/kernels-empty/bench_cudp_U11PSy.o"], ["MODULE_VERSION_STACK=3.2.8", ... ]) = 0

Now the failing one:

[pid 17398] execve("/bin/sh", ["sh", "-c", 0x14595af0], <list of vars>) = -1 EFAULT (Bad address)

Here <list of vars> is identical. It seems its not the list of environment variables that cause the bad address. As Chris Dodd mentioned the 3rd argument to execve is the raw pointer 0x14595af0, which strace thinks (and the kernel agrees) is invalid. strace does not recognize it as a string (so it prints the hex value and not the string).

Edit:

I inserted print out of the pointer value cmd to see what's the value of this pointer in the parent process:

  string gen = jit_command.str();
  cout << gen << endl;
  char* cmd = new(nothrow) char[gen.size()+1];
  if (!cmd) __error_exit("no memory for jitter command");
  strcpy(cmd,gen.c_str());
  cout << "cmd = " << (void*)cmd << endl;
  int ret;
  if (ret=system(cmd)) {
    cout << "failed cmd = " << (void*)cmd << endl;
    cout << "Problems calling nvcc jitter: ";

Output (for the failing call):

cmd = 0x14595af0
failed cmd = 0x14595af0
Problems calling nvcc jitter: exited, status=127
Checking shell.. ok!

Its the same pointer value as the 3rd argument from strace. (I updated the strace output above).

Regards the 32bit looking of the cmd pointer: I checked the value of the cmd pointer for a succeeding call. Can't see any difference in structure. That's one of the values of cmd when then system call succeeds:

cmd = 0x145d4f20

So, before the system call the pointer is valid. As the strace output from above suggests the child process (after calling fork) receives the correct pointer value. But, for some reason, the pointer value is marked invalid in the child process.

Right now we think its either:

  • libc/kernel bug
  • hardware problem

Edit:

Meanwhile let me post a workaround. Its so silly to be forced to implement something like that... but it works. So the following code block gets executed in case the system call fails. It allocates new command strings and retries until it succeeds (well not indefinitely).

    list<char*> listPtr;
    int maxtry=1000;
    do{
      char* tmp = new(nothrow) char[gen.size()+1];
      if (!tmp) __error_exit("no memory for jitter command");
      strcpy(tmp,gen.c_str());
      listPtr.push_back( tmp );
    } while ((ret=system(listPtr.back())) && (--maxtry>0));

    while(listPtr.size()) {
      delete[] listPtr.back();
      listPtr.pop_back();
    }

Edit:

I just saw that this workaround in one particular run did not work. It went the whole way, 1000 attempts, all with newly allocated cmd command strings. All 1000 failed. Not only this. I tried on a different Linux host (same Linux/software configuration tho).

Taking this into account one would maybe exclude a hardware problem. (Must be on 2 physically different hosts then). Remains a kernel bug ??

Edit:

torek, i will try and install a modified system call. Give me some time for that.

ritter
  • 7,447
  • 7
  • 51
  • 84
  • 2
    Use `strace` on your program to find out what is happenning. – Basile Starynkevitch Mar 12 '12 at 19:41
  • Thanks Basile! Thats not so easy. I run it through mpirun. Still possible? – ritter Mar 12 '12 at 19:43
  • are you using `system("/bin/sh")` or `system("/bin/sh foo bar")`? – Anycorn Mar 12 '12 at 19:47
  • You can use `strace -f -p 1234` to understand the syscalls done by process of pid 1234 – Basile Starynkevitch Mar 12 '12 at 19:51
  • Using 'system("foo")'. This calls '/bin/sh -c foo' (according man page system) – ritter Mar 12 '12 at 19:52
  • Thanks again Basile! That's a great tool. At the moment i am running the mpi application and with an interactive session stracing the process. But I am not sure yet if this helps to find the reason.. – ritter Mar 12 '12 at 20:02
  • 1
    Can you try calling `system(NULL)` to verify you have shell? – Anycorn Mar 12 '12 at 20:02
  • 1
    Also: is `foo` in `$PATH`? Try invoking with absolute path. – Anycorn Mar 12 '12 at 20:04
  • Anycorn, I already give the absolute path. I can try inserting a conditional and check for system(NULL) – ritter Mar 12 '12 at 20:09
  • More information can be found using perror(). It will work as long as stdout isn't closed. If it is, then feel free to print strerror(errno) to whatever stream is available. – dbeer Mar 12 '12 at 20:12
  • Basile, something really weird. Guess what. When observing the process with strace it did tun fine. Then I rerun again and it seems to be fine also. How can that be?! Nothing else touched. – ritter Mar 12 '12 at 20:20
  • 2
    Show us the actual code for your call to `system()`. – Keith Thompson Mar 12 '12 at 20:23
  • Ok Keith. Here you go: system returns the `int` 32512. Its format is specified in wait(2). When calling `WEXITSTATUS` on this int it gives 127 – ritter Mar 12 '12 at 20:27
  • Here we are again! Same error. This time I was able to strace the process. I'll edit the question.. – ritter Mar 12 '12 at 20:40
  • 2
    The `EFAULT` means you're passing a bad pointer to `system()`. So as others have said, please show us exactly how you are calling it, and what happens to its parameter before that. – mpontillo Mar 12 '12 at 20:53
  • Thanks Mike, i'll edit the question again with the call. – ritter Mar 12 '12 at 20:56
  • +Anycorn, I checked for `system(NULL)`. The shell is available, i.e. return value not equal zero – ritter Mar 12 '12 at 23:48
  • What about the code around the call to `system`, perhaps you've run into a [lifetime problem with `gen.c_str`](http://stackoverflow.com/questions/6456359/what-is-stdstringc-str-lifetime). If you copy the string to a buffer and call system, does the `EFAULT` go away? – user7116 Mar 12 '12 at 23:50
  • @sixlettervariables, that's exactly what i just tried. I copied the ::c_str() to a c-string (yes, including '\0') and launched `system` on that buffer. Result: Same error 127, aka EFAULT. – ritter Mar 12 '12 at 23:54
  • Now it looks more like a kernel bug – ritter Mar 12 '12 at 23:57
  • @Frank: another interesting find is that a non-NULL terminated address list after a [`fork` can cause this behavior](http://www.wlug.org.nz/EFAULT). Which I think the nvcc is doing, perhaps there is a bug in `nvcc`? – user7116 Mar 13 '12 at 00:01
  • @Keith, have a look at the code above. Its now the full length version. – ritter Mar 13 '12 at 00:06
  • @sixlettervariables your buffer copy is in! look at the code above – ritter Mar 13 '12 at 00:06
  • 1
    Step through the program with [gdb](http://dirac.org/linux/gdb/01-Introduction.php). For an easier to use version, look into [pyclewn if you use vim](http://pyclewn.sourceforge.net/news.html), [gdb mode in emacs](http://www.gnu.org/software/emacs/manual/html_node/emacs/GDB-Graphical-Interface.html), [cgdb](http://www.cs.sandia.gov/cplant/doc/runtimeAdmin1_0/node64.html), or [ddd](http://www.gnu.org/software/ddd/). – Spencer Rathbun Mar 12 '12 at 19:47
  • @sixlettervariables doesn't the `strace` output show clearly that it never comes to a call to nvcc? maybe i don't get your point. – ritter Mar 13 '12 at 00:17
  • @Frank: apologies, that was before I saw your full code. I thought the clone/fork was inside of nvcc and the execve was nvcc's doing. – user7116 Mar 13 '12 at 00:26
  • I included a `system` call to `/usr/local/cuda/nvcc --help` in the conditional basic block if `system` fails. Just to see if the program is still reachable. This succeeds! Why on earth does the prior call fail? – ritter Mar 13 '12 at 00:30
  • 2
    Pretty baffling. The key line in the `strace` output: `[pid 16081] execve("/bin/sh", ["sh", "-c", 0xdda1d98], [/* 58 vars */]) = -1 EFAULT (Bad address)` shows that the `new` space is at 0xdda1d98 and omits 58 env variables; the EFAULT address must be either the one shown, or one of those 58 env variables. Perhaps you could check all the variables in `environ` for validity, just to make sure they're all OK at that point, – torek Mar 13 '12 at 00:43
  • Perhaps run strace with `-f -s 2048 -v -e trace=process` to get more of the arguments to `execvc` printed. – user7116 Mar 13 '12 at 00:50
  • @torek on the way.. it take a little the error doesn't show itself very often – ritter Mar 13 '12 at 01:41
  • @torek caught the `system` call again when failing. i'll update the question in a minute... – ritter Mar 13 '12 at 01:55
  • @torek there is one variable that is zero. '`LS_COLORS=`' is shown. It is also zero hundreds of other times when this call succeeds. You think is a problem? – ritter Mar 13 '12 at 02:30
  • If it's literally `LS_COLORS=0` that's fine, if it's address-0 that's a NULL pointer (on any Unix-y system these days) and not right. Looks more like the address you got back from `new` has become invalid in the child process (the result of the clone), which implies a kernel (or possibly hardware) problem, or just barely possible, a shared-VM thread issue. Since it occurs rarely, perhaps you could work around it by repeating the `system()` call if necessary, but if this works, it's just papering over the underlying problem. – torek Mar 13 '12 at 05:48
  • `cmd` appears not to be a valid address for some reason. That's annoying, but maybe it has something to do with the rest of your program. Do you use a garbage collector or something like that? Could you try to debug the piece of code after allocating `cmd`? –  Mar 13 '12 at 06:39
  • @torek I have already tried a `while` loop until the call succeeds. But once it fails all subsequent calls fail as well with the same error code. Its not literally `LS_COLORS=0`, but it reads `LS_COLORS=`. But it shouldn't be the problem since in the same run the same call (same program location) succeeds before with the same set of environs. – ritter Mar 13 '12 at 09:50
  • @artsenay Garbage collection is not in use. Its a single-threaded program, so no interference from potentially bad-behaving other program parts (would you agree?). `system` appears to be libc function, that splits `cmd` into an array of `c-strings`. The address of that array is passed (and must be NULL terminated). Maybe `system` didn't do a good job there and its not NULL terminated. How could one debug the `char*[]` that is passed as argument to execve? (The whole program is started through `mpirun -n8`) – ritter Mar 13 '12 at 10:52
  • Doesn't the WIFEXITED just mean the process exited normally (though with an error return code). According to [the google](http://tldp.org/LDP/abs/html/exitcodes.html), 127 means "command not found". I assume you already checked this, but didn't see in the question/comments: did you run that output command from the shell? Another thought - you could try calling the `exec*()` methods directly, for more internal details? Another question - is `$LIB_PATH` defined in your profile, or in that list of variables? – Rob I Mar 13 '12 at 12:49
  • @Rob `exec` is a front-end to `execve` which replaces current process image. That's not what i want. Yes, the command is in place. It is mentioned in the comments/question that the command succeeds several times (in the same run) before it fails. Environment is not the problem here – ritter Mar 13 '12 at 13:28
  • Does it act any differently when you skip over the gen and cmd stuff? if(ret = system(jit_command.str().c_str())) – Martin Mar 13 '12 at 13:59
  • @Martin That was my first version. Then I though it might be safer to copy the string to a new buffer each time in the hope the call doesn't fail. Negative. – ritter Mar 13 '12 at 14:14
  • Does it work if you run it without `mpirun`? – Karl Bielefeldt Mar 13 '12 at 15:17
  • @KarlBielefeldt Thats a good point. This problem seems to disappear when running without mpirun. However, the problem is not strictly reproduceable. So i can't tell for sure – ritter Mar 13 '12 at 15:25
  • Go to your kernel source tree in the file `include/linux/binfmts.h`, find the line with `#define MAX_ARG_PAGES 32` and change the value to something sane like 64 or 128. Recompile the kernel and try again :-) – Gunther Piez Mar 13 '12 at 17:15
  • `LS_COLORS=` (with nothing after it) is fine as far as env vars go (empty contents is OK, in fact the kernel does not even enforce the `=` sign). So it's down to just the one arg that strace prints in hex; and presumably it prints it in hex because it's not a valid address, or what's at that address is not a valid string, in the child of the `clone` call. Since you can print it as a valid string in the original process, something has gone wrong in that small span between `clone` and `execve`. – torek Mar 13 '12 at 17:51
  • @torek I don't think its an address of a string the number in hex. But an address of an array of c-strings, i.e. `char*[]`. See man page execve. To split the c-string into several c-strings is job of the `system` call. I agree something goes wrong there. How can one debug what's going on there? – ritter Mar 13 '12 at 17:56
  • See @Chris Dodd's note answer just added here: strace knows about the execve syscall and extracts and expands all three arguments. It looks to me as though `system` has done its job but then something goes wrong during or after the `clone`. I'll add one more answer where I can format code better... – torek Mar 13 '12 at 18:59
  • Keep track of the length of the string you're passing to `system`. It could be too long -- I remember hitting a limit there in the past. – JimR Mar 15 '12 at 15:58
  • @JimR, i think not the issue here. It runs say 10 to 30 times well until it fails. `system` called with the same length of string – ritter Mar 15 '12 at 16:24

2 Answers2

5

This is a weird one. strace understands that the arguments to execve are (pointers to) strings, so it prints out the pointed at strings UNLESS the pointer is invalid -- in which case it prints out the raw hex value of the pointer. So the strace line

[pid 16081] execve("/bin/sh", ["sh", "-c", 0xdda1d98], [/* 58 vars */]) = -1 EFAULT (Bad address)

makes perfect sense -- the 3rd argument to execve is the raw pointer 0xdda1d98, which strace thinks (and the kernel agrees) is invalid. So the question is, how is an invalid pointer getting here. This should be cmd, which just came back from new.

I would suggest putting the line

printf("cmd=%p\n", cmd);

just before the system call, to figure out what the C code thinks the pointer is.

Looking at the rest of the strace, it looks like you're running on a 64-bit system (from the pointers that are printed), and the invalid 0xdda1d98 looks like a 32 bit pointer, so it would seem to be some kind of 32/64 bit screwup (someone only saving and restoring 32 bits of a 64 bit register, or some such).

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • You are absolutely right. The `system` call succeeds several times before it fails. When it succeeds one can see the `cmd` string as 3rd argument. (I'll update the question). When it fails we see the hex number. I took your advice and printed the pointer values. (See question) – ritter Mar 14 '12 at 09:12
  • Regards the 32bit looking of the `cmd` pointerL I checked the `cmd` pointer for a call that succeeds. Looks the same (difference value, but also 32bit value). See question – ritter Mar 14 '12 at 11:41
  • Smells like a simple buffer overrun somewhere. Have you run the program under valgrind ? – nos Mar 15 '12 at 12:17
2

Piggybacking off / extending @Chris Dodd's answer, consider that system itself looks (oversimplified on purpose) like this:

int system(char *cmd) {
    pid_t pid = fork();
    char *argv[4];
    extern char **environ;

    if (pid == 0) { /* child */
        argv[0] = "sh";
        argv[1] = "-c";
        argv[2] = cmd;
        argv[3] = NULL;
        execve("/bin/sh", argv, environ);
        _exit(127);
    }
    if (pid < 0) ... handle error ...
    ... use OS wait() calls to wait for result from child process ...
    return status; /* as provided by sh -c, or from _exit(127) above */
}

Given "64 bit system" and "register appears to be lopped off at 32 bits" it might be worth doing an objdump on the code and seeing whether argv[2] is set from a register whose upper bits might be somehow lost during the clone call (where I have fork above, glibc is using clone for efficiency).


Update: per the strace output, the clone call is not using CLONE_VM and CLONE_VFORK (not sure why not, these should make the call a lot more efficient) so the child is a "normal" child (a la old-Unix-style fork). A colleague suggested that perhaps the failing address is in a map that is set not to be copied into the child process. The contents of /proc/self/maps would be interesting after the failure; we could look at how the failing address is mapped. Comparing those maps with the ones in the child would be even more interesting. To get the ones in the child, though, you'll need to override the glibc version of system, and add something to read /proc/self/maps after the execve fails, before doing the _exit.
torek
  • 448,244
  • 59
  • 642
  • 775
  • This might be oversimplified here. `cmd` is not passed as just one argument. Take into account that `strace` understands it to be a `char*[]`. – ritter Mar 13 '12 at 22:18
  • `char **`, actually, pointing to the first element of an array—in this case, `&argv[0]`. `strace` then extracts argv[0], argv[1], ..., argv[i] until argv[i] is `NULL`, and then treats each extracted value as a `char *` pointing to the first element of a C string, and extracts each `char` until reaching the 0 byte. Since it's not expanding 0xdda1d98 in its output, that says that 0xdda1d98 is the invalid address causing the `EFAULT` error. – torek Mar 13 '12 at 22:23
  • Correct! Update: I had the value of pointer `cmd` to be printed. See question. Its the same value that enters as 3rd argument to `execve` and cannot be interpreted as the command string. – ritter Mar 14 '12 at 11:23
  • Well, this problem continues to be quite impressive. :-) The pointer is valid before the `clone` system call, the value is the same both before and after as well, but the address has become invalid. If it's not a kernel or hardware bug, the only remaining possibility is that the shared virtual address space is being altered by some other thread/process that has access to it. This seems exceedingly unlikely since that alteration would *also* affect the parent process. So ... kernel/hardware bug? – torek Mar 14 '12 at 19:38
  • I added an update with an interesting point about maps, based on an "interesting problems" conversation I had at the gym today. – torek Mar 15 '12 at 00:25