How do I avoid large number of locks?

Question

Let arr be declared as follows.

std::vector<std::vector<int>> arr(10,000,000);

My code in serial is something like

for (const auto [x, y] : XY) {
  arr[x].push_back(y);
}

I use openmp and define an array of locks as follows.

std::vector<omp_lock_t> locks(10,000,000);

With locks, I use

#pragma omp parallel for schedule(dynamic)
for (const auto [x, y] : XY) {
  omp_set_lock(&locks[x]);
  arr[x].push_back(y);
  omp_unset_lock(&locks[x]);
}

Such an approach works in my machine (windows linux subsystem). But then I find the following post

c/c++ maximum number of mutexes allowed in Linux

which raises the concern whether I have used too many locks, and my program may not work for other platforms (those with a limit on the number of locks allowed).

I wonder if there is another way in which I still have the same control granularity as above and it does not have upper limit on the number allowed. Can I use something like compare and swap?

It really depends on what you want to do. In the simplest scenario, you may just use `std::atomic_int`. — Jodocus, May 13 '21 at 08:55
Take it simple. What you don't like when there is only one lock? — Louis Go, May 13 '21 at 08:57
Why don't you use [std::mutex](https://en.cppreference.com/w/cpp/thread/mutex) and [std::condition_var](https://en.cppreference.com/w/cpp/thread/condition_variable) ? You could decide to have a mutex for a slice of e.g. 1024 elements. On Linux, see [futex(2)](https://man7.org/linux/man-pages/man2/futex.2.html) and the source code of [GNU libc](https://www.gnu.org/software/libc/) or of [MUSL libc](http://musl.libc.org/) — Basile Starynkevitch, May 13 '21 at 09:00
What you do in `//update a[i]` is crucial for this question. For me it seems like a serious overkill to lock every int separately. Is it even possible that 2 threads will access the same value at the same time? — Yksisarvinen, May 13 '21 at 09:04
Yes, two threads running on different cores of a multicore processor can access the same value at the same time. Read about [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence) — Basile Starynkevitch, May 13 '21 at 09:06
@BasileStarynkevitch I was thinking strictly about C++. If the update is just something like `a[i]++;`, then locking is not necessary, you can parallelize by splitting the data into chunks and having each thread only deal with its own assigned part of the memory. — Yksisarvinen, May 13 '21 at 09:12
Try to create (and use) a spinlock using `std::atomic` like described in https://rigtorp.se/spinlock/ — Ralequi, May 13 '21 at 09:13
@dreamcrash Thanks. I have updated the problem. I cannot share the complete code since it is private. — Sevaro, May 13 '21 at 09:17
Does `XY` contain duplicate `x` values? Is the order of `y` per each `x` important? Can `XY` be sorted? — Yksisarvinen, May 13 '21 at 09:44
@Jodocus `omp_lock_t` seems to be effectively an atomic int. — Daniel Langr, May 13 '21 at 10:17
@DanielLangr Yes, but it's not the lock that's about to be manipulated, but the `arr` vector (containing vectors). — Jodocus, May 13 '21 at 10:34
@Jodocus Sorry, I kind-of do not understand. Are you suggesting using `atomic_int` for vector elements? First, it's not even possible, though this problem can be solved with `std::atomic_ref` in C++20, or `#pragma omp atomic` in OpenMP. However, what about vector reallocations caused by `push_back`? — Daniel Langr, May 13 '21 at 10:44
@DanielLangr No, my comment was referring to the original, unedited post. In this context (which was not clear at this point), my comment was a possible suggestion. — Jodocus, May 13 '21 at 11:07

Daniel Langr · Accepted Answer · 2021-05-14T11:21:50.127

According to the OpenMP documentation, omp_lock_t represents "a simple lock". I suppose it is some kind of a spinlock. You, therefore, don't need to care about limits for a number of mutexes. (Mutexes require some interaction with a kernel / scheduler, which is likely a reason for restricting their count.)

omp_lock_t uses a single memory location, which is fine for a spinlock. For instance, this live demo shows that omp_lock_t takes only 4 bytes, while std::mutex 40 bytes: https://godbolt.org/z/sr845h8hx.

Note that a spinlock can be implemented with a single byte, even with a single bit (BTS on x86_64). Therefore, you can compress your locks even much more in memory. However, I'd be careful with such an approach since it can bring significant false sharing issues.

I think your solution is pretty fine. Since the operation in the critical section should be very fast, and there is likely a low probability that multiple threads will access the same element at the same moment, a spinlock is a suitable solution in my opinion.

EDIT

As pointed out in comments, the term simple lock may not mean that the lock is actually a spinlock. It's up to the OpenMP implementation to decide which type of lock it will use. Then, if you want to be sure that you are using a true spinlock (which I still believe is suitable), you can either write your own or use some library that provides one. (Note that the efficient spinlock implementation is not that simple. For instance, it needs to spin on a load/read operation instead of on exchange/test-and-set, which often naive implementations do.)

The implementation of an OpenMP lock is entirely up to the system. "Simple lock" in the standard merely means that it is not a recursive (counting) lock, or a reader-writer lock. You cannot assume that an `omp_lock_t` is not implemented with an underlying system lock. The size of an omp_lock_t is also unspecified. If you switch to LLVM you'll see that `sizeof(omp_lock_t) == sizeof(void *)` so a pointer may be stored in it which can point to an arbitrary amount of storage. The "single memory location" might also be an index into a table of locks... — Jim Cownie, May 14 '21 at 08:16
@JimCownie You are right and I updated the answer accordingly. I even check the LLVM implementation and it seems that there is some conditional compilation where the type of locks can be chosen. — Daniel Langr, May 14 '21 at 11:19
Not only at compile time. Check out the KMP_LOCK_KIND envirable. — Jim Cownie, May 15 '21 at 14:11

Jim Cownie · Answer 2 · 2021-05-14T08:18:37.853

A couple of possible solutions

If you are on a modern Intel processor that supports Transactional Synchronization Extensions (TSX) you could use a single, speculative lock. (See [Two Small OpenMP Features You May Have Missed][1]). That shows a very similar use case.

You could use a smaller array of locks and use a hash to map from the array index into the lock-set. Something like this (untested and uncompiled!)

// Allocate and initialise the lock array, wherever you have that!
    enum { ln2NumLocks = 10,
           NumLocks = 1<<ln2NumLocks }
    omp_lock_t locks[NumLocks];
    for (int i=0; i<NumLocks; i++)
        omp_init_lock(&locks[i]);

// Hash from array index to lock index. What you have here
// probably doesn't matter too much. It doesn't need to be
// a crypto-hash!
int mapToLockIdx(int arrayIdx) {
    return ((arrayIdx >> ln2NumLocks) ^ arrayIdx) & (numLocks-1);
}

// Your loop is then something like this
#pragma omp parallel for schedule(dynamic)
for (const auto [x, y] : XY) {
    auto lock = &locks[mapToLock(x)]; 
    omp_set_lock(lock);
    arr[x].push_back(y);
    omp_unset_lock(lock);
}



  [1]: https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Cownie.pdf

How do I avoid large number of locks?

2 Answers2