2

A bug has recently been "fixed" in a project I work on, but so far nobody has been able to explain why the fix works. (So is it really a fix?) The code is running in kernel space under a realtime system, so the problem causes a complete system lockup. This makes debugging harder than normal, too.

This version crashes the system:

int  dups[EMCMOT_MAX_AXIS] = {0};
char *coords = coordinates;
char coord_letter[] = {'X','Y','Z','A','B','C','U','V','W'};

This version does not crash

int  dups[EMCMOT_MAX_AXIS];
char *coords = coordinates;
char coord_letter[] = {'X','Y','Z','A','B','C','U','V','W'};
int  i;
for (i=0; i<EMCMOT_MAX_AXIS; i++) {dups[i] = 0;}

To really confuse matters, this experimental version also crashes

int  dups[EMCMOT_MAX_AXIS] = {0};
char *coords = coordinates;
char coord_letter[] = {'X','Y','Z','A','B','C','U','V','W'};
int  i;
for (i=0; i<EMCMOT_MAX_AXIS; i++) {dups[i] = 0;}

You can see the commit and the surrounding code here: https://github.com/LinuxCNC/linuxcnc/commit/ef6f36a16c7789af258d34adf4840d965f4c0b10

bodgesoc
  • 191
  • 9
  • 3
    I see no reason those changes should cause/fix the problem. I suspect the change is a red herring, and the real problem exists elsewhere. Perhaps a race condition or something that only happens occasionally exists, and is causing you to *think* this is the problem. – Jonathon Reinhart Mar 08 '20 at 16:40
  • 2
    Provide a [mre]. – Eric Postpischil Mar 08 '20 at 17:09
  • The problem is 100% repeatable, so isn't occasional. A race condition might be possible, but these modules are loaded in sequence by a script specifically because they depend upon each other. – bodgesoc Mar 08 '20 at 17:10
  • What compiler are you using? – Nate Eldredge Mar 08 '20 at 17:26
  • A minimum reproducible example would need to include a specific (RTAI) Linux kernel. I doubt that anyone outside of the project would want to go that far in looking for an answer. – bodgesoc Mar 08 '20 at 17:26
  • Compiler is gcc (Debian 8.3.0-6) 8.3.0 – bodgesoc Mar 08 '20 at 17:31
  • With what target? You should be able to inspect the generated assembly and confirm that the array gets properly initialized either way. – Nate Eldredge Mar 08 '20 at 17:31
  • But anyway, I'm inclined to think Jonathon is right: the problem is somewhere else, and it's mere coincidence that this change triggers it. The function seems pretty self-contained, and only called in two places, so if the crash is indeed within this function, you should be able to implement enough logging to figure out what inputs provoke it. – Nate Eldredge Mar 08 '20 at 17:50
  • I have tried adding "print" logging in several places in the function, but never get to see the output. – bodgesoc Mar 08 '20 at 18:42
  • I have tried adding "print" logging in several places in the function, but never get to see the output. (this is using tail -f /var/log/kern.log as that is where output from these kernel modules ends up). I don't know whether I would have any more luck logging to a serial port. It's quite a lot of effort to find out. – bodgesoc Mar 08 '20 at 18:51
  • 1
    Show us the assembly code of that function before and after the change. – 0andriy Mar 08 '20 at 20:37
  • [Here](https://godbolt.org/z/yqVmzJ) is the relevant file (first version) on the godbolt compiler explorer if anyone is interested. (I ran it through the preprocessor to make it self-contained, sorry it's so long.) I think you can see the initialization happening correctly at lines 28-30 of the assembly output. – Nate Eldredge Mar 08 '20 at 22:47
  • Thanks for that. I am afraid I haven't had cause to use assembly since the 1980s (on the Z80) so I can't immediately decipher it. But, given that it seems to be possible to initialise the array twice, in two different ways, and still get the crash, I am becoming even more perplexed. – bodgesoc Mar 08 '20 at 23:36
  • Further info. If gcc-6 is chosen then the problem appears to go away. The compiler explorer that Nate Eldridge set up seems to show some differences in output, too. – bodgesoc Mar 09 '20 at 08:58
  • @NateEldredge, I don't believe compiler put XMM use when compiled for kernel. I guess, we need to see two functions before and after the change in the assembly. That's easy to achieve. – 0andriy Mar 09 '20 at 09:30
  • It really would be nice if this could be reproduced in any way in a userland example. I suspect it might be the optimizer exploiting UB, but I can't spot the place. – Greg A. Woods Mar 10 '20 at 18:14
  • @GregA.Woods: It sounds from OP's answer like the code and the compiler are fine per se, but that it wasn't compiled with the correct options for kernel code. So it ended up generating code that would be just fine in userspace, but fails in the kernel since sse registers are not safe to use. – Nate Eldredge Mar 10 '20 at 19:18

1 Answers1

2

Thanks to Nate Eldredge for setting up the Compiler Explorer and 0andriy for the %xmm0 pointer. This does look like an issue with the compiler using register unsafe for kernel code (or some closely related issue). Experimenting with the Godbolt site I was able to find that the -mno-sse2 compiler flag has a similar effect to switching to gcc-6 in removing use of the $xmm0 register in that code. And, when added to the compiler flags in the actual application compilation it appears to solve the issue. Some more work is likely to be needed to get to the bottom of the right solution but we seem to have some good pointers now.

0andriy
  • 4,183
  • 1
  • 24
  • 37
bodgesoc
  • 191
  • 9
  • I couldn't tell from the code, but are you saying this code runs in a Linux kernel module? If you build your module [the right way](https://www.kernel.org/doc/Documentation/kbuild/modules.txt), it should automatically get all the correct compiler flags, including `-mno-sse2` and several more that will forestall other problems you will otherwise surely encounter in the future. – Nate Eldredge Mar 10 '20 at 18:59
  • The Makefile is complicated as, depending on the realtime system chosen, the same code can be compiled to run as a kernel module under RTAI, or as a userspace module under preempt-rt or Xenomai. (I think that kernel mode Xenomai and user-mode RTAI may also be possible). (I did mention kernel space and realtime in the original problem description) Thanks for the link to the kernel module building instructions, however I don't see anything helpful about compile flags there. Note: LinuxCNC has been compiling and running properly for decades (and many parts are mysterious to me) – bodgesoc Mar 10 '20 at 20:46
  • 1
    The point is that when you build with `make -C M=$PWD`, you pull in a Makefile written by the kernel maintainers that contains all the right compile flags for the kernel version in question. [Here](https://pastebin.com/YB7spBCy) you can see the result of running `make V=1` on a simple module, showing all the flags that got used for this particular kernel. – Nate Eldredge Mar 10 '20 at 20:51
  • 2
    So in short, it seems to me that the "right solution" is to fix your build system so that it uses all the compiler options that the kernel maintainers have determined are needed for a given kernel version (noting in particular that a compiler version newer than the corresponding kernel is not guaranteed to work). This could be done automatically by pulling in kernel Makefiles, or by hand (more error-prone). – Nate Eldredge Mar 10 '20 at 20:57
  • Thanks @NateEldredge. It turns out that RTAI kernel modules use a different set of cflags (provided by RTAI) as floating-point is specifically supported in RTAI kernel modules. I am in discussion with one of the RTAI developers and a more knowledgable LinuxCNC developer and we feel it is a stack alignment issue. We have found that setting both -mpreferred-stack-boundary=4 and -mincoming-stack-boundary=3 seems to work. And we are working on understanding if that's a terrible idea – bodgesoc Mar 12 '20 at 01:12