0

This is my second post. My first post was as an undergrad from an Intro to C class. I hope I do better this time.

I am developing a stochastic model for my research and I am looking for means of increasing computational efficiency in my algorithms. I have recently been introduced to multithreading by a colleague, and I have implemented it successfully into a few of my algorithms that did not require locks...

Now I am trying to modify/thread an algorithm that will require locks. In my literature/StackOverflow/google searches, I have learned quite a lot about the mutex class. However, the problem I am facing seems to be unique. For context, the closest thing I could find to an answer was here:

What happens when two threads ATTEMPT to lock the same resource at the exact same time?

It still didn't directly address the question I am about to ask.

I have been attempting to control memory access between threads using a dynamically allocated array of pointers to mutex objects. My code compiles without warnings/errors (compiled w/flags -pthread -std=c++11), and executes perfectly up until this threaded algorithm where a Segfault is thrown. Upon investigating with lldb I found that the exception being thrown is as follows:

lldb output screenshot

Upon further investigation, I found that the threads being referenced by the lldb output are all attempting a try_lock() on the EXACT SAME mutex object since the pointer array containing the address of that object was passed to each thread.

My question, which is similar to the previously mentioned post is:

What happens when a try_lock() is attempted by multiple threads (more than one) on the same mutex at the EXACT SAME time (same stroke of the processor clock)? Do newer implementations of the mutex class have workarounds for this seemingly catastrophic event (i.e. shard_mutex, timed_mutex, etc)?

This has recently been blowing my mind, so any insight would be greatly appreciated. Big shoutout to the S/O community for all the help, on this post and on all others that have been invaluable to my growth as a programmer.

LINK TO CODE:

https://github.com/tylerbalbright/StackOverflow_7_4.git

Error occurs in the RVEdata.cpp either at line 751 or 857.

FIXED BUT NOT SOLVED:

I was able to fix my code using a deque of mutex objects as opposed to creating a vector of pointers to dynamically created mutexes. The solution had been proposed by other users here:

How can I use something like std::vector<std::mutex>?

In my first (unsuccessful) trial, I was attempting to create an array of pointers to mutexes like this:

long int N = RuntimeVector.size(); //Varying size at runtime
std::mutex *MutexPtrs;
MutexPtrs = new std::mutex[N];

I would then pass the newly created array to a function as a pointer which would pass the pointer to the array to a new thread like this:

void SomeFunction(std::mutex *PosLocks[])
{
.
..
...

    SearchersPos.push_back(std::async(std::launch::async, &RVEdata::PositiveSearch, this, PosLocks));
}

Using this method, the code didn't fail on EVERY execution, but it failed like 90% of the time with EXC_BAD_ACCESS. What is interesting though, is that on EVERY failed execution the bad access was thrown when multiple threads were attempting to try_lock the same mutex. It NEVER failed with just 1 thread attempting a try_lock on a lone mutex. When I ran the same code on our HPC, the failures occurred like 95% of the time, but I couldn't find as much information in gdb as I could in lldb (I am less proficient in command line gdb).

PS - I am operating on macOS High Sierra 10.13.6, Apple LLVM version 10.0.0 (clang-1000.11.45.5).

  • 2
    I'm pretty sure that locking, which (and I'm speaking out of a vague memory here) is done via the operating system as an atomic operation which means that there will be no overlapping. I'd be glad if someone could correct me on that though. – vlind Jul 04 '19 at 18:51
  • 3
    This should work just fine. That's the point of mutexes. We need code which shows this going wrong to understand where the bug is – Mike Vine Jul 04 '19 at 18:53
  • 3
    The whole point is that only **ONE** thread would succeed. The mutex is explicitly defined for this exact use case. The implementation will use the appropriate (usually hardware) mechanism for the platform to guarantee that only one thread acquires the lock. – Martin York Jul 04 '19 at 18:57
  • 3
    Note: **BAD_ACCESS** usually means a memory address that is not allocated to your application by the OS. This means you probably have a bad pointer somewhere. – Martin York Jul 04 '19 at 19:01
  • 2
    What will happen is that both memory accesses will go into their respective processors’ queues. From that point on, the MESI/MOESI/MESIF cache-coherence logic (including bus snooping) will decide in somewhat arbitrary, hardware-dependent fashion a “winner”. That winner is allowed to perform its access first, and grabs the lock first. The other “loses” and the cache-coherency protocol makes it appear to it that the lock has been taken. – Iwillnotexist Idonotexist Jul 04 '19 at 19:03
  • 7
    One thread will succeed, another thread will not succeeed. Your segfault has nothing to do with this, and is because of some other bug in your code. This is not a "catastrophic event" of any kind, and is precisely what mutexes are designed to handle, by definition. That's their job, to do this kind of thing. – Sam Varshavchik Jul 04 '19 at 19:03
  • 1
    If two threads trying to acquire a mutex at the same time crashes then either your standard library implementation sucks (this would be a very serious defect) or the problem is elsewhere. – François Andrieux Jul 04 '19 at 19:24
  • From these comments, and my readings of the atomicity of the try_lock() function, I am confident my error exists elsewhere in my code. The entire source code is quite large, so I have uploaded it as a repository with the link included in the updated question. I am going to continue digging. Thanks for the timely replies! – Tyler Albright Jul 04 '19 at 19:27
  • Try compiling with `-fsanitize=address`, and if that doesn't show any problems, try `-fsanitize=undefined` instead. I bet you're simply accessing past the end of the mutex array. – Jonathan Wakely Jul 05 '19 at 15:13
  • Why do you think you need a `deque` instead of a `vector`? The answer you linked to explains that you can't **resize** a `vector` but that doesn't stop you just constructing it with the right size, which is fine for your case, because you don't need to resize it. Just do `std::vector poslocks(N);`. – Jonathan Wakely Jul 05 '19 at 15:19
  • Your code seems to split the search party into multiple threads, and then have every thread run exactly the same search: `SearchersPos.push_back(std::async(std::launch::async, &RVEdata::PositiveSearch, this, PosLocks));` Won't they just be doing exactly the same work, and therefore wasting resources? – Jonathan Wakely Jul 05 '19 at 15:21
  • The threads work in a sort of leap-frog manner. The boolean vectors drive the action, and the action typically flows from low index to high index. if I assign threads to a chunk of the boolean (i.e. Thread #1 takes indices 1 thru 100, #2 takes 101 thru 200, so on and so forth), then it turns out that the threads pretty much wait for the completion of the thread created just before it: Almost a sequential operation. For my PosLocks and GndLocks variables, I was originally declaring those variables as class member variables which was restricting my ability to declare their size at program start. – Tyler Albright Jul 05 '19 at 15:56

2 Answers2

2

When more than one thread does a try lock at the exact same time, one of them ends up with the lock and the others fail.

Your princess bug is in another castle caused by something else.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
0

What happens when a try_lock() is attempted by multiple threads (more than one) on the same mutex at the EXACT SAME time (same stroke of the processor clock)? Do newer implementations of the mutex class have workarounds for this seemingly catastrophic event (i.e. shard_mutex, timed_mutex, etc)?

There is absolutely nothing catastrophic about it, this is what mutexes are for. You have not found a flaw in how mutexes work, select is not broken. You should read The First Rule of Programming by one of StackOverflow's founders.

In my first (unsuccessful) trial, I was attempting to create an array of pointers to mutexes like this:

The problem is that you're creating two arrays of mutexes:

std::mutex *PosLocks;
std::mutex *GndLocks;

PosLocks = new std::mutex[N];
GndLocks = new std::mutex[N];

But then you're passing the addresses of the arrays:

NetExists = FindNetwork(N, &PosLocks, &GndLocks);

which calls this function:

bool RVEdata::FindNetwork(long int N, std::mutex *PosLocks[], std::mutex *GndLocks[])

So now you have passed std::mutex** arguments to the function, which could be either a pointer to an array of mutexes or an array of pointers to mutexes.

And then you're accessing them as an array of mutex* instead of a pointer to mutex[]:

    if (PosLocks[i]->try_lock() == true)

PosLocks[i] indexes into an array to obtain a mutex* and then uses the -> operator to dereference a pointer. But that's backwards! You do not have an array of mutex*!

As I said in my comment above, you're simply accessing the array incorrectly. This means the program tries to lock a mutex object at some address where there is no mutex object, because you're reading an address which is somewhere in the middle of a mutex object, not at its start.

The line above should dereference the pointer first, to get to the mutex[], and then index into the array:

    if ((*PosLocks)[i].try_lock() == true)

Or even better would be to just stop passing a pointer to the array in the first place, i.e. declare the function as:

bool RVEdata::FindNetwork(long int N, std::mutex PosLocks[], std::mutex GndLocks[])

and call it as:

NetExists = FindNetwork(N, PosLocks, GndLocks);

Now you can just access the array directly:

    if (PosLocks[i].try_lock() == true)

Even better, stop using dynamically created arrays:

std::vector<std::mutex> PosLocks;
std::vector<std::mutex> GndLocks;
// ...
NetExists = FindNetwork(N, PosLocks, GndLocks);
// ...

bool RVEdata::FindNetwork(long int N, std::vector<std::mutex>& PosLocks, std::vector<std::mutex>& GndLocks)
{
    // ...
    if (PosLocks[i].try_lock() == true)

The problem would not have happened if you'd not been mixing arrays and pointers and dynamic allocation.

Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521
  • Thank you for the in-depth analysis. It makes perfect sense as to why my code was behaving the way it was. As for the last segment of your answer, using vectors to store the mutex objects was my first attempt (that wouldn't compile) since mutex objects are not copyable by default. STL containers have been both my excalibur and the bane of my existence. After being introduced to them I seem to frequently misuse bare bones dynamic allocations. – Tyler Albright Jul 05 '19 at 15:47
  • So don't copy them. Pass the `vector` by reference, then it isn't copied. The problem was not creating a `vector`, it was what you tried to do with it afterwards. – Jonathan Wakely Jul 05 '19 at 15:49