0

I'm failing to understand a specific scenario in which my C++ multi-threaded application (running on a Linux machine, Wind River 6.x) is facing a segmentation fault.

I know the concept of segmentation fault and even went over this post and also this one but failed to encounter a scenario similar to mine and/or an answer to my question, so I'm posting this question.

My code that generates the segmentation fault is as follows (abbreviated and simplified):

// MyStruct* pMyStruct is a function argument that arrives to the function and at some point of time
// being set to NULL
ASSERT_PTR_NE(pMyStruct, NULL); <--- this assertion is logged to my application log (meaning, at this line, pMyStruct is NULL)
int someInt = pMyStruct->someIntOfMyStruct; <--- this line does NOT create the segmentation fault
double someDouble = pMyStruct->someDoubleOfMyStruct;  <--- this line ALSO does NOT create the segmentation fault
ASSERT_NUM_EQ(pMyStruct->someIntOfMyStruct, SOME_INT_VALUE_TO_CHECK); <--- this line DOES create the segmentation fault

As mentioned in the last code line, the 4th line of code is the "last line" that my application is executing (I guess) --> when examining the core file with GDB, frame 0 of the core file indicates that this line is the line that causes the crash.

My questions are if so:

  1. How come the 2nd and 3rd lines of code of my application did not cause segmentation fault?

  2. What exactly takes place, system wise, i.e. - in the OS and the application from the moment the NULL was accessed (in the first line) until the application is being terminated by the OS? Meaning, is it possible that indeed the actual segmentation fault was raised due to the 1st line, YET, for some reason, until the OS actually took the decision and action to terminate the application, also lines 2-4 were executed and when arriving to the 4th line the application "again" raised segmentation fault?

  3. Or, perhaps, is it possible that what actually took place here is an overrun of the pMyStruct variable - meaning, after the first line that does the assert (and prints info to the log file of the application) another thread set the pMyStruct to NON NULL value, thus "allowing" lines 2-3 to run WITHOUT causing a crash and then JUST before line 4 was executed the pMyStruct was "overrun" by another thread and was set to NULL thus, this time causing line 4 to crash?

MSalters
  • 173,980
  • 10
  • 155
  • 350
Guy Avraham
  • 3,482
  • 3
  • 38
  • 50
  • 2
    The mistake is assuming every bad pointer operation will cause a segfault. Sometimes you get unlucky and only memory that your process is coincidentally allowed to access gets accessed, which leads to unexpected behavior or corruption instead of a crash. When you get a segfault it's because your code does not respect the rules of C++ and has Undefined Behavior. At that point, you cannot reliably log, recover or otherwise reliably do anything with that error. – François Andrieux May 05 '22 at 13:23
  • 1
    Lines 2 and 3, unless you're disabling optimizations completely, might not actually exist in your executable. The reads could be re-ordered to where the values are actually used (like the 4th line). And a pointer doesn't have to be NULL to be invalid. Lastly, if there is any chance that another thread could modify that pointer anywhere during that piece of code's execution, you have undefined behavior already regardless of the validity of the pointer and need to fix that. – Mat May 05 '22 at 13:27
  • *When and how does exactly a segmentation of a C/C++ application is reported and handled by the OS?* For an OS that supports detecting SEGV, it happens as soon as a pointer tries to read or write memory that is not in the application's user mode space. The errant program is immediately terminated, and the failure reported to the user. The `nullptr` tripping a SEGV is because it is common for OS that support SEGV detection to mark address starting at 0 to some upper address (e.g., 64KB mark, or 1MB mark) as not being in the user mode space. – Eljay May 05 '22 at 14:27

1 Answers1

0

Typically, an OS creates a segmentation fault after the CPU faults on an address. The CPU doesn't know why the fault happened. It might be that the memory is paged out to disk, but for this question we're assuming a bad pointer. The OS knows it's a bad pointer because the address doesn't correspond to any paged-out memory. Hence, the OS tells the CPU it is handling the situation, and tells the CPU to continue execution in the signal handler.

The C++ null pointer isn't special to the CPU. It just so happens that the OS by convention does not allocate RAM at this address.

By C++ standards, your code has Undefined Behavior, and that allows "time travel". More accurately, to allow optimizations, compilers may shuffle around code in the assumption that Undefined Behavior does not happen. It would seem that lines 2 & 3 are shuffled after line 4. You can't detect this in a correct C++ program.

This is not how a typical CPU sees it. Modern CPU's also shuffle around instructions internally, like compilers do, but when the CPU reports the fault to the OS it will pretend that all instructions happened in the right order.

MSalters
  • 173,980
  • 10
  • 155
  • 350