2

I've been trying to figure out the most efficient way where many threads are altering a very big byte array on bit level. For ease of explaining I'll base the question around a multithreaded Sieve of Eratosthenes to ease explaining the question. The code though should not be expected to fully completed as I'll omit certain parts that aren't directly related. The sieve also wont be fully optimised as thats not the direct question. The sieve will work in such a way that it saves which values are primes in a byte array, where each byte contains 7 numbers (we can't alter the first bit due to all things being signed).

Lets say our goal is to find all the primes below 1 000 000 000 (1 billion). As a result we would need an byte array of length 1 000 000 000 / 7 +1 or 142 857 143 (About 143 million).

class Prime {
    int max = 1000000000;    
    byte[] b = new byte[(max/7)+1];

    Prime() {
        for(int i = 0; i < b.length; i++) {
            b[i] = (byte)127;  //Setting all values to 1 at start
        }
        findPrimes();
    }

    /*
     * Calling remove will set the bit value associated with the number
     * to 0 signaling that isn't an prime
     */
    void remove(int i) {
        int j = i/7; //gets which array index to access
        b[j] = (byte) (b[j] & ~(1 << (i%7)));
    }

    void findPrimes() {
        remove(1); //1 is not a prime and we wanna remove it from the start
        int prime = 2;
        while (prime*prime < max) {
            for(int i = prime*2; i < max; i = prime + i) {
                remove(i);
            }
            prime = nextPrime(prime); //This returns the next prime from the list
        }
    }

... //Omitting code, not relevant to question
}

Now we got a basic outline where something runs through all numbers for a certain mulitplication table and calls remove to remove numbers set bits that fits the number to 9 if we found out they aren't primes.

Now to up the ante we create threads that do the checking for us. We split the work so that each takes a part of the removing from the table. So for example if we got 4 threads and we are running through the multiplication table for the prime 2, we would assign thread 1 all in the 8 times tables with an starting offset of 2, that is 4, 10, 18, ...., the second thread gets an offset of 4, so it goes through 6, 14, 22... and so on. They then call remove on the ones they want.

Now to the real question. As most can see that while the prime is less than 7 we will have multiple threads accessing the same array index. While running through 2 for example we will have thread 1, thread 2 and thread 3 will all try to access b[0] to alter the byte which causes an race condition which we don't want.

The question therefore is, whats the best way of optimising access to the byte array.

So far the thoughts I've had for it are:

  1. Putting synchronized on the remove method. This obviously would be very easy to implement but an horrible ideas as it would remove any type of gain from having threads.
  2. Create an mutex array equal in size to the byte array. To enter an index one would need the mutex on the same index. This Would be fairly fast but would require another very big array in memory which might not be the best way to do it
  3. Limit the numbers stored in the byte to prime number we start running on. So if we start on 2 we would have numbers per array. This would however increase our array length to 500 000 000 (500 million).

Are there other ways of doing this in a fast and optimal way without overusing the memory?

(This is my first question here so I tried to be as detailed and thorough as possible but I would accept any comments on how I can improve the question - to much detail, needs more detail etc.)

Asthor
  • 598
  • 4
  • 17
  • Which is the problem with race condition? if 1 or 10 threads are changing the array position to zero it doesn't matter. Right!? – rdllopes Mar 06 '14 at 16:15
  • The issue is how we bitshift. We are reading b[i] and altering it, which means that thread 1 might read it, thread 2 might read it and then thread 1 alters it followed by thread 2 altering it and thread 2 would of course write the result from its alteration on what it read which means the alteration from thread 1 would get lost. Maybe thats a solution though, can we do the changing of bits in an atomic way? – Asthor Mar 06 '14 at 16:21
  • it is not possible. =( you could try to isolate thread by range. – rdllopes Mar 06 '14 at 16:38
  • Just a question: I suppose the data are persisted by the threads. If not, why using several threads? – C.Champagne Mar 06 '14 at 17:15
  • @C.Champagne Might be an language barrier so not sure I fully understood the question. But yes, the threads would be run inside an inner class with the data saved outside for use after the threads are done. – Asthor Mar 06 '14 at 18:04
  • @Asthor Maybe I didn't plainly understand what you intended to do... My concern was that splitting a task in several threads can improve the performence if the are bottlenecks (such as writing on a disk to persist data). If there is no bottleneck, there is an important probability that the processing time can get longer. – C.Champagne Mar 07 '14 at 17:42
  • @C.Champagne Ahh now I understand. Well no the data isn't persisted in that sense, not in the context of the question at least. It of course applies to the question that we limit how many threads we have based on availableProcessors (and considering if there is intel hyperthreading as well) to achieve any speedup. The goal with the idea above is to maximize speed of saving bits to an array with many threads accessing the same array index and only keeping the data outside the running of the program. (No real practical use, only theoretical playing around with efficiency). – Asthor Mar 07 '14 at 18:02

2 Answers2

5

You can use an array of atomic integers for this. Unfortunately there isn't a getAndAND, which would be ideal for your remove() function, but you can CAS in a loop:

java.util.concurrent.atomic.AtomicIntegerArray aia;

....

void remove(int i) {
    int j = i/32; //gets which array index to access
    do {
        int oldVal = aia.get(j);
        int newVal = oldVal & ~(1 << (i%32));
        boolean updated = aia.weakCompareAndSet(j, oldVal, newVal);
    } while(!updated);
}

Basically you keep trying to adjust the slot to remove that bit, but you only succeed if nobody else modifies it out from under you. Safe, and likely to be very efficient. weakCompareAndSet is basically an abstracted Load-link/Store conditional instruction.

BTW, there's no reason not to use the sign bit.

Asthor
  • 598
  • 4
  • 17
Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • 1
    Why use `weakCompareAndSet(...)`? To quote the javadocs: May fail spuriously and does not provide ordering guarantees, so is only rarely an appropriate alternative to `compareAndSet(...)`. – Gray Mar 06 '14 at 16:33
  • @Gray And this is exactly one of those situations. Spurious failures are A-OK, because we can just loop and try it again. Guaranteeing no spurious failures is not worth any extra time, because we don't rely on such a guarantee. – Sneftel Mar 06 '14 at 16:51
  • I'm really not sure you want to use `weak...` unless you know specifically what it does. It seems like it is very OS dependent on the difference between it and the "strong". See here: http://stackoverflow.com/questions/2443239/java-atomicinteger-what-are-the-differences-between-compareandset-and-weakcompa – Gray Mar 06 '14 at 16:54
  • @Sneftel: what about happens-before relationship that isn't guaranteed with the weak version? a thread reading the array may see inconsistent values in it, due to thread visibility problems. – Eyal Schneider Mar 06 '14 at 16:55
  • Yeah, had been looking at AtomicIntegerArrays but somehow my head got stuck in that I couldn't use a bigger than byte array (don't ask me why, sometimes focusing on the problem to much makes one unable see it in a bigger context). But this would do exactly what we need for for minimizing memory usage and acheiving the speed we want.You are also right on the signed bit as there is no reason to not mess with it here. I basically did not alter it out of habit as messing with the sign bit can sometimes be tricky. – Asthor Mar 06 '14 at 16:57
  • @EyalSchneider That's not a problem either; the function here doesn't rely on any particular ordering. – Sneftel Mar 06 '14 at 17:30
  • 1
    @Sneftel: remember than once the code completes, some thread has to go over the array and check the set bits. How can it guarantee seeing the most up to date values, if happens-before isn't established between the writes and the reads? Java Memory Model spec becomes tricky once you want to prove correctness on code with data races. – Eyal Schneider Mar 06 '14 at 17:41
  • @EyalSchneider once the code completes, all the worker threads have ended. At that point, a full barrier can be performed. There's the potential for a work-wasting race condition depending on how `nextPrime` is implemented, but the additional ordering provided by the strong CAS isn't of much help in resolving it anyway. – Sneftel Mar 06 '14 at 17:44
  • `join` is exactly how the code knows the worker threads have ended. `weakCompareAndSet` is the right tool for the job here. The operation is commutative and idempotent, and there is no requirement that the setting of different flags have any particular ordering. – Sneftel Mar 06 '14 at 18:57
  • @Gray it's worth noting that the "weakness" of `weakCompareAndSet`'s ordering is entirely with respect to *other atomic variables*. Operations on the same atomic (which is all a call to `remove` cares about) are coherently ordered. – Sneftel Mar 06 '14 at 19:01
  • Ok, I learned something here. If you are already in a while loop, you might as well use `weakCompareAndSet(...)` because it will retry anyway. Is that your understanding @Sneftel? – Gray Mar 06 '14 at 19:17
  • @Gray Pretty much, yeah... though of course it depends on what *else* the while loop is doing. You can't use this to emulate CAS2, for instance. – Sneftel Mar 06 '14 at 19:33
  • @Gray on some architectures, the equivalent of `weakCompareAndSet` is the *only* atomic primitive available, and other primitives (CAS, atomic increment, etc.) are implemented using it. – Sneftel Mar 06 '14 at 19:35
  • Yeah and on others the `weakCompareAndSet` is implemented the same as `weakCompareAndSet` -- i.e. the weak never fails from spurious problems. Looks like it depends highly on the architecture @Sneftel. – Gray Mar 06 '14 at 19:38
  • @Sneftel: You are right. From javadoc: "weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides no guarantees with respect to previous or subsequent reads and writes of *any variables other than the target of the weakCompareAndSet*. " – Eyal Schneider Mar 06 '14 at 20:57
1

I think you could avoid synchronizing threads...

For example, this task:

for(int i = prime*2; i < max; i = prime + i) {
            remove(i);
}

it could be partitioned in small tasks.

for (int i =0; i < thread_poll; i++){
    int totalPos =  max/8; // dividing virtual array in bytes
    int partitionSize  = totalPos /thread_poll; // dividing bytes by thread poll
    removeAll(prime, partitionSize*i*8,  (i + 1)* partitionSize*8);
}
....

// no colisions!!!

void removeAll(int prime, int initial; int max){
    k = initial / prime;
    if (k < 2) k = 2;
    for(int i = k * prime; i < max; i = i + prime) {
        remove(i);
    }
}
rdllopes
  • 4,897
  • 4
  • 19
  • 36
  • This might work yeah, would have to include a check so that the splitting isn't inside the same index byte array. – Asthor Mar 06 '14 at 18:13
  • I tested implementing this yesterday into an sorting through an very big array (2 billion numbers) and an optimised sieve (no even numbers etc) and it showed a speed up with the threads of about 1.5 (I can optimise it even better with better logic on starting number for each thread in the remove loop so we could get a better speed up). The Atomic array has a speed up of about 3 though. +1 on the idea though, I liked it. – Asthor Mar 07 '14 at 14:03
  • Unfortunately, I don't have the whole code here to test by myself. But, take a look in this RecursiveAction... http://www.omsn.de/blog/how-to-parallelize-loops-with-java-7-fork-join-framework. In C, I will suggest to try openmp. – rdllopes Mar 07 '14 at 14:16