29

I have a program that is making a huge number of calls to Long.bitCount(), so many that it is taking 33% of cycles on one CPU core. Is there a way to implement it that is faster than the Sun JDK version?

I have tried:

  • This algorithm (I think this is exactly how the JDK implements it)
  • lookup tables of various sizes between 28 and 222 (looking at a few bits at a time and adding the results)

But I couldn't do any better than a 216-entry lookup table with a manually-unrolled loop (about 27% CPU.)
How else might this be optimized for Java?


Note: this question is about Java-specific optimization, but this similar (language-agnostic) question has many other algorithms.

Community
  • 1
  • 1
finnw
  • 47,861
  • 24
  • 143
  • 221
  • 4
    Out of curiosity, what is it that you're doing that's making this many calls to that function? – templatetypedef Jan 29 '11 at 20:08
  • 12
    +1: Hurrah! A question about optimization where the questioner has actually profiled. – Oliver Charlesworth Jan 29 '11 at 20:10
  • @templatetypedef, it's part of an image recognition function. – finnw Jan 29 '11 at 20:11
  • 4
    Unfortunately, if there were a faster way, the JDK would probably already have adopted it. – Oliver Charlesworth Jan 29 '11 at 20:23
  • @finnw, can you show some code, "profiling bitcount" would suck big time, since the profiling itself will take way more time than the execution of the method – bestsss Jan 29 '11 at 22:49
  • @Oli Charlesworth, I dont feel excited profiling bit-twidlling function, also if the CPU supports the function itself, the VM would use intrinsic instead of the bit-twiddling. – bestsss Jan 29 '11 at 22:51
  • @finnw, exactly; profiling the inner loop function that takes like 30CPU clocks leads to overhead of like 7-8times (if they use System.nanoTime() + loop/CAS for execution count/time - calc. the time and updating the memory is a lot more expensive than the func). Btw, you are running on 64bits, right. I seriously doubt that's the function that causes the problem. Also for the larger table scan, if you happen to get a cache miss, it'd be quite an expensive operation. Also most likely w/ hrof the function cannot be inlined. – bestsss Jan 30 '11 at 03:56
  • @bestsss, yes the function can be inlined with hprof active (in that case the timings are merged into those of the calling method.) – finnw Feb 01 '11 at 18:39
  • I don't think that there is much place for optimization in the current implementation. You might get better results if you look at the bigger picture. Maybe there are some range restrictions in place that would allow better optimization, or maybe the image has a limited color palette and you could cache values just for it. Maybe you can apply some dynamic programming, use a more efficient algorithm or using some approximation methods to avoid so many calls. I don't know about the others, but I'm interested to see more code (unless its 100 lines long :D) – Dan Manastireanu May 08 '11 at 11:11
  • @finnw: could you post a histogram of the distribution of the bit counts from those 64-bit samples? That is, `bin[0]` is the relative frequency that a random 64-bit sample has none bit set, `bin[64]` is the relative frequency that all bits are set, and so on. Alternatively, if you can post the distribution of the bits (such as Bernoulli) it will also help. – rwong May 08 '11 at 12:26
  • 1
    @finnw: to help us determine whether profiling or language overhead is overshadowing the timing result, please try implement the bit-counting code in a lower-level language such as C, C++ or Assembly, perform the timing test with 1 billion elements on the same machine, and post the timing results of **both** the Java and the native implementation. (I understand that your final implementation still needs to be in Java, but we have to establish the baseline machine performance first.) – rwong May 08 '11 at 12:32
  • @rwong, they are roughly binomially distributed, with P varying between 0.3..0.5 – finnw May 08 '11 at 12:42
  • You speak in percentage of CPU use (for all cores ? One core ?) This is good to know where most of the time is passed. But to mesure progress doesn't the total time spent more interresting? And the time spent in this specific part. For exemple going from 33% to 27% doesn't seem impressive, but it is a gain of 28% in speed for this part and 10% for the total time. Not that bad no ? (Anyway even if this part was instantaneous, you couldn't gain more than 33%). But run you program many time, and see the % difference in time when you test your optimisations. – Nicolas Bousquet May 11 '11 at 08:04
  • 1
    I'll add too that you should consider performance only if you consider that your software is too slow. In that case you should have something like : This processing is too slow. User are complaining or the computer can keep up with needed througput. Then optimize for this objective. But here the objective is somewhat different. You want less % of time passer in this part of the program. In a sense slowing down others part would do the trick. What the interrest of foccussing of %CPU usage ? – Nicolas Bousquet May 11 '11 at 12:09
  • @Nicolas Bousquet, exact timings are dependent on (1) my hardware (2) the size of the data set. The result is used to build a web page so there is a fixed "budget" in terms of latency (which happens to be 100ms.) The relevant cost/benefit ratio is "Number of cores" / "Number of simultaneous queries possible at peak time whilst complying with the time limet." – finnw May 11 '11 at 12:21
  • The percantage is also dependant of the hardware. You might not have the same bandwidth, the same caching subssytem, the same processor. The same disk. So depending of the system, more time can be spend in one part or the other. From what you say, what you want to mesure in fact is the number of request per core you can serve conrently. Testing differents architectures can be quite interresting though, and help seem what change most the score, the CPU core frequency, memory bandwidth, number of core, size of the cache... With theses results in mind, you can the optimize the relevant part. – Nicolas Bousquet May 11 '11 at 13:29

8 Answers8

12

If you are on a recent x86 CPU there is an instruction for this, popcnt.

In recent versions of Java, Long.bitCount() uses this instruction. Just use -XX:+UsePopCountInstruction (this is the default in recent versions)

However, there are some bugs with it in JRE 6.0_u18 through 7.0_u5: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7063674

jdabtieu
  • 534
  • 5
  • 20
Scott Carey
  • 1,587
  • 17
  • 13
4

This seems like one of those problems that is simply perfect for the GPU to work on. It should be able to slash your time by a couple orders of magnitude.

Otherwise I think you may have to deal with it at a higher level. Having multiple threads working on different segments of data at a time (which I'm sure you already do), processing the data while you are collecting it, sharing the work around multiple systems--something like that.

Bill K
  • 62,186
  • 18
  • 105
  • 157
  • Although this function could be split among threads, the other cores have plenty of work to do already so I don't think it could increase the overall throughput. – finnw Jan 30 '11 at 03:13
  • @finnw I agree with the threads--the GPU is by far your best means of accelerating this job. It's also the most work, of course. – Bill K Feb 04 '11 at 01:02
  • For the other threads, you tried ? Or this is a supposition ?. – Nicolas Bousquet May 11 '11 at 07:53
  • 1
    And for the GPU? You don't want to try? There are API for that in JAVA. I know that a GPU is not so common on servers, but with some testing you could discover than one high end GPU (+/- 500$) could perform as good a dozen CPU cores in this type of situations. Because your CPU seem to be already busy, you'll gaim more parallelism. – Nicolas Bousquet May 11 '11 at 08:08
  • 1
    @finnw although I haven't done it, this seems appropriate: http://code.google.com/p/java-gpu/ – Bill K May 14 '11 at 00:40
  • @Nicolas Bousquet -- At 33% of the time spent, if the GPU could offload this, the application would at most go 50% faster. There is no way that you could be as fast as a dozen CPUs if it can only offload 33% of the work! This could only be true if the GPU could offload at least ~92% of the work and have nearly no overhead. – Scott Carey Jul 03 '12 at 02:33
4

If you machine has an integer ALU that can process data wider than some multiples of 64 bits (also known as SIMD, such as SSE2 or VMX), you can compute the bit counts on several 64-bit elements at once.

Unfortunately, this will require you to provide machine-specific implementations in a lower-level language than Java.

rwong
  • 6,062
  • 1
  • 23
  • 51
2

I suspect that your app is memory-bound rather than CPU-bound, i.e. it spends more time fetching the values from memory than counting their bits. In that case you should try to reduce the size of the working set or improve access locality to reduce cache misses (if the algorithm allows it).

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • 2
    Probably (it's hard to tell from the profiler output) but I don't think there is much room for improvement as I search sequentially through the array of keys. – finnw May 10 '11 at 15:30
1

I'm no expert in the subject, but in case you haven't seen these pages, they may help:

http://www.reddit.com/r/programming/comments/84sht/fast_bit_couting_algorithms/

http://www-graphics.stanford.edu/~seander/bithacks.html

You may also want to poke around the many graphics libraries out there, especially those that are lower-level and/or speak directly to hardware.

EDIT: looks like you can use the relatively newly introduced POPCNT instruction (available on some recent AMD and Intel processors) for a potential speed increase, if you have the option to write low-level platform-specific code, and can target that specific architecture. http://kent-vandervelden.blogspot.com/2009/10/counting-bits-population-count-and.html and another article with benchmarks: http://www.strchr.com/crc32_popcnt

jkraybill
  • 3,339
  • 27
  • 32
1

From my understanding:

I would use the 33% as an indicator only as profiling for small methods could really change the overall performance. So i would run the algorithm on some big dataset and see the total time. And I would consider the efficiancies of my optimization based on that total time changes. I would also include a warning up phase so that the JIT can do it's optimisations.

In fact the bit counting thing seem to be one of the key part of your algorithm anyway... if you optimize everything, and manage to get 10 time faster for all key part, you still profile something near 33% for this part. That's not bad by essence.

Inspiring from this link http://bmagic.sourceforge.net/bmsse2opt.html you could try to use SSE instruction present in all intel/AMD processor now if I remember right (you could alway failback to JAVA otherwise). An interresting part concerning the article is... That most of the time, it is memory bound anyway. But I would still try to see how this could work for you.

A GPU would be a perfect fit for insanely fast processing (easy hundred time one of a CPU core) and bandwidth. Main problem would be pushing data to CPU dedicated memory and getting result back. But if you don't just perform bit counting but more more operation, this could bring huge gains.

There is not shortcut anyway, you must try several approach and see what bring the most gain. Don't count % through but total time spent.

Nicolas Bousquet
  • 3,990
  • 1
  • 16
  • 18
1

I am now using this method, which interleaves four popcnt operations at a time. It is based on this C implementation.

private static final long M0=0x5555555555555555L,
                          M1=0x3333333333333333L,
                          M2=0x0f0f0f0f0f0f0f0fL;
public void store4Tags(long tag0, long tag1, long tag2, long tag3) {
    long count0 = tag0,
         count1 = tag1,
         count2 = tag2,
         count3 = tag3;
    count0 = (count0 & M0) + ((count0 >>> 1) & M0);
    count1 = (count1 & M0) + ((count1 >>> 1) & M0);
    count2 = (count2 & M0) + ((count2 >>> 1) & M0);
    count3 = (count3 & M0) + ((count3 >>> 1) & M0);

    count0 = (count0 & M1) + ((count0 >>> 2) & M1);
    count1 = (count1 & M1) + ((count1 >>> 2) & M1);
    count2 = (count2 & M1) + ((count2 >>> 2) & M1);
    count3 = (count3 & M1) + ((count3 >>> 2) & M1);

    count0 = (count0 + (count0 >>> 4)) & M2;
    count1 = (count1 + (count1 >>> 4)) & M2;
    count2 = (count2 + (count2 >>> 4)) & M2;
    count3 = (count3 + (count3 >>> 4)) & M2;

    count0 += count0 >>> 8;
    count1 += count1 >>> 8;
    count2 += count2 >>> 8;
    count3 += count3 >>> 8;

    count0 += count0 >>> 16;
    count1 += count1 >>> 16;
    count2 += count2 >>> 16;
    count3 += count3 >>> 16;

    count0 += count0 >>> 32;
    count1 += count1 >>> 32;
    count2 += count2 >>> 32;
    count3 += count3 >>> 32;

    storeWithPopCnt(tag0, 0x3f & (int) count0);
    storeWithPopCnt(tag1, 0x3f & (int) count1);
    storeWithPopCnt(tag2, 0x3f & (int) count2);
    storeWithPopCnt(tag3, 0x3f & (int) count3);
}

This outperforms the lookup table version slightly, and consumes no cache.

Community
  • 1
  • 1
finnw
  • 47,861
  • 24
  • 143
  • 221
0

Rather than optimise this function, you are likely to be better off optimising the usage of this function. E.g. you could keep a counter.

public void set(int n) {
   if(!get(n)) bitCount++;
   // set the bit
}
public void clear(int n) {
   if(get(n)) bitCount--;
   // clear the bit
}
public int bitCount() {
   return bitCount;
}

This avoids scanning the data by keeping track of the number of the count of bits set. This moves the overhead to how often bits and set or cleared and makes getting the number of bits set trivial. It appears in your use case, the later is much more often.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Won't help in my application, as the most common use case is `bitCount(a ^ b)` to compute the Hamming distance between `a` and `b`. It is not practical to maintain counts for all pairs. – finnw May 08 '11 at 11:08
  • @finnw: based on your profiling, do you find that `(a == 0 || b == 0 || a == b)` occurs frequently in your application? If so, you might just take that as a shortcut. – rwong May 08 '11 at 12:23
  • @rwong, no, zeros are very rare. – finnw May 08 '11 at 12:43
  • If zeros are very rare you can check for ~0 first. (Or are they not that rare?) – Peter Lawrey May 08 '11 at 13:44
  • I mean whole words of zeros, not zero bits – finnw May 09 '11 at 07:39