what (r+1 + (r >> 8)) >> 8 does?

Question

In some old C/C++ graphics related code, that I have to port to Java and JavaScript I found this:

b = (b+1 + (b >> 8)) >> 8; // very fast

Where b is short int for blue, and same code is seen for r and b (red & blue). The comment is not helpful.

I cannot figure out what it does, apart from obvious shifting and adding. I can port without understanding, I just ask out of curiosity.

You can port without understanding?! That is a skill to treasure. — seanhodges, May 14 '15 at 12:40
Without knowing where those color values come from or why they're being manipulated, it's probably impossible to get an answer unless somebody just happens to recognize some sort of trick. — Pointy, May 14 '15 at 12:44
@seanhodges, of course you can port code from one language to another without understanding the algorithm involved. Why not? — , May 14 '15 at 13:03
Are r,g and b values given by 6 differt channels? I mean a single 8bit channel for low part of red and a 8bit channel that give hi part of red and so on? What is the source of r,g and b value? Image file, streaming data.... — LPs, May 14 '15 at 13:27

score 10 · Accepted Answer · edited May 23 '17 at 11:51

y = ( x + 1 + (x>>8) ) >> 8 // very fast

This is a fixed-point approximation of division by 255. Conceptually, this is useful for normalizing calculations based on pixel values such that 255 (typically the maximum pixel value) maps to exactly 1.

It is described as very fast because fully general integer division is a relatively slow operation on many CPUs -- although it is possible that your compiler would make a similar optimization for you if it can deduce the input constraints.

This works based on the idea that 257/(256*256) is a very close approximation of 1/255, and that x*257/256 can be formulated as x+(x>>8). The +1 is rounding support which allows the formula to exactly match the integer division x/255 for all values of x in [0..65534].

Some algebra on the inner portion may make things a bit more clear...

       x*257/256
     = (x*256+x)/256
     = x + x/256
     = x + (x>>8)

There is more discussion here: How to do alpha blend fast? and here: Division via Multiplication

By the way, if you want round-to-nearest, and your CPU can do fast multiplies, the following is accurate for all uint16_t dividend values -- actually [0..(2^16)+126].

y = ((x+128)*257)>>16 // divide by 255 with round-to-nearest for x in [0..65662]

P.S. Doing image manipulation on pixels that have been encoded in 8-bits is typically incorrect (but often close enough) due to the non-linear mapping introduced by [*gamma compression*](http://en.wikipedia.org/wiki/Gamma_correction). — Brent Bradburn, May 14 '15 at 14:58
Notes relevant to "fast multiplies": http://stackoverflow.com/q/6357038 — Brent Bradburn, Sep 22 '16 at 03:12

score 3 · Answer 2 · answered May 14 '15 at 12:47

3

Looks like it is meant to check if blue (or red or green) is fully used. It evaluates to 1, when b is 255, and is 0 for all lower values.

answered May 14 '15 at 12:47

kaykay

556
2
7

1

If `b` is between 1 and 255 this makes sense. But then `(b >> 8)` would always be zero... writing `(b+1) >> 8` would be enough. – ArnonZ May 14 '15 at 12:52
The value is a short, though. The maximum value is 65535 (for unsigned). – seanmk May 14 '15 at 12:56
That's the maximum physical value. Maybe there's a maximum logical value that the code enforces. – ArnonZ May 14 '15 at 12:59

user2465201 · Answer 3 · 2016-09-22T11:38:54.510

A common use case of when you'd want to use a formula that's more accurate than 257/256 is when you have to combine a lot of alpha values together for each pixel. As one example, when doing image shrinking, you need to combine 4 alphas for each source pixel contributing to the destination, and then combine all the source pixels contributing to the destination.

I posted an infinitely accurate bit twiddling version of /255 but it was rejected without reason. So I'll add that I implement alpha blending hardware for a living, I write real time graphics code and game engines for a living, and I've published articles on this topic in conferences like MICRO, so I really know what I'm talking about. And it might be useful or at least entertaining for people to understand the more accurate formula that is EXACTLY 1/255:

Version 1: x = (x + (x >> 8)) >> 8 - no constant added, won't satisfy (x * 255) / 255 = x, but will look fine in most cases. Version 2: x = (x + (x >> 8) + 1) >> 8 - WILL satisfy (x * 255) / 255 = x for integers, but won't hit correct integer values for all alphas

Version 3: (simple integer rounding): (x + (x >> 8) + 128) >> 8 - Won't hit correct integer values for all alphas, but will on average be closer than Version 2 at the same cost.

Version 4: Infinitely accurate version, to any level of precision desired, for any number of composite alphas: (useful for image resizing, rotation, etc.):

[(x + (x >> 8)) >> 8] + [ ( (x & 255) + (x >> 8) ) >> 8]

Why is version 4 infinitely accurate? Because 1/255 = 1/256 + 1/65536 + 1/256^3 + 1/256^4 + ...

The simplest expression above (version 1) doesn't handle rounding, but it also doesn't handle the carries that occur from this infinite number of identical sum columns. The new term added above determines the carry out (0 or 1) from this infinite number of base 256 digits. By adding it, you are getting the same result as if you added all the infinite addends. At which point you can round by adding a half bit to whatever accuracy point you want.

Not needed for the OP perhaps, but people should know that you don't need to approximate at all. The formula above is actually more accurate than double precision floating point.

As for speed: In hardware, this method is faster than even a single (full width) add. In software, you have to consider throughput vs latency. In latency, it may still be faster than a narrow multiply (definitely faster than a full width multiply), but in the OP context, you can unroll many pixels at once, and since modern multiply units are pipelined, you are still OK. In translation to Java, you probably have no narrow multiplies, so this could still be faster, but need to check.

WRT the one person who said "why not use the built in OS capabilities for alpha blitting?": If you already have a substantial graphical code base in that OS, this might be a fine option. If not, you're looking at hundreds to thousands as many lines of code to leverage the OS version - code that's far harder to write and debug than this code. And in the end, the OS code you have isn't portable at all, while this code can be used anywhere.

One of the reasons given for rejecting your edits to my post was "This edit deviates from the original intent of the post. Even edits that must make drastic changes should strive to preserve the goals of the post's owner." In other words, the appropriate thing to do is write your own post, as you have now done. — Brent Bradburn, Sep 22 '16 at 03:02

LPs · Answer 4 · 2015-05-14T12:56:44.877

2

Is value of b+1 + b/256, this calculation divided by 256.

In that way, using bit shift the compiler tranlte using CPU level shift instruction, instead of using FPU or library division functions.

edited May 14 '15 at 12:56

answered May 14 '15 at 12:40

LPs

16,045
8
30
61

1

This is true, but it doesn't really address the question of why the code is carrying out that operation. – Pointy May 14 '15 at 12:42
1

I think the OP is fully aware that `>> 8` is `/ 256`. What they're asking is *why* the expression is being done, that is, what does it mean, semantically. – lurker May 14 '15 at 12:44

Brett Walker · Answer 5 · 2015-05-14T12:56:28.977

2

I suspect that it is trying to do the following:

boolean isBFullyOn = false;

if (b == 0xff) {
  isBFullyOn = true;
}

Back in the days of slow processors; smart bit-shifting tricks like the above could be faster than the obvious if-then-else logic. It avoids a jump statement which was costly.

It probably also sets an overflow flag in the processor which was used for some latter logic. This is all highly dependant upon the target processor.

And also on my part speculative!!

edited May 14 '15 at 12:56

answered May 14 '15 at 12:42

Brett Walker

3,566
1
18
36

but `bool isFullyOn = b == 0xFF` would be faster still (assuming no compiler optimization of course.) – Dale Wilson May 14 '15 at 13:21
In Java this would be true; but the OP is porting some old C/C++ graphic code that was most likely compiled to target a specific CPU and this would have lead to highly performant assembly code. – Brett Walker May 14 '15 at 13:23
@Dave what you believe is fast is actually syntactic sugar for the code I posted above. It implies a jump in the assembly code. – Brett Walker May 14 '15 at 13:30
Yep. I realized that shortly after I posted it. I guess the real point of the question is that programmers who hack the compiler into producing the desired code should be required to document: what they are trying to do, why they think it works, and what version of the compiler they tested it on. – Dale Wilson May 14 '15 at 13:34
See the link @nobar posted. These types of optimisations are quite common. – Brett Walker May 14 '15 at 13:37

score 1 · Answer 6 · answered May 14 '15 at 12:56

1

b = (b + (b >> 8)) >> 8; is basically b = b *257/256 .

I would consider +1 being an ugly hack of the -0.5 mean reduce caused by the inner >>8.

I would write it as b = (b + 128 + ((b +128)>> 8)) >> 8; instead.

answered May 14 '15 at 12:56

user3528438

2,737
2
23
42

Your equivalence is off (should be `b*257/256/256`), but seeing your answer triggered me to recognize this formula. Thanks. – Brent Bradburn May 15 '15 at 13:38

score 1 · Answer 7 · edited May 23 '17 at 12:19

Running this test code:

public void test() {
    Set<Integer> results = new HashSet<Integer>();
    // short int ranges between -32767 and 32767
    for (int i = -32767; i <= 32767; i++) {
        int b = (i + 1 + (i >> 8)) >> 8;
        if (!results.contains(b)) {
            System.out.println(i + " -> " + b);
            results.add(b);
        }
    }
}

Produces all possible values between -129 and 128. However, if you are working with 8-bit colours (0 - 255) then the only possible outputs are 0 (for 0 - 254) and 1 (for 255) so it is likely that it is attempting the function @kaykay posted.

what (r+1 + (r >> 8)) >> 8 does?

7 Answers7