2

I have the following code:

static char toUpper(char c)
{
    if (c >= 'a' && c <= 'z')
        return (char) (c & ~('a' - 'A'));

    return c;
}

Why does this work to correctly come up with the uppercase letter? The bitwise operations are what I am mostly interested in. I understand the math more or less and I understand that it does work but not the underlying reason as to why it works.

Daniel Centore
  • 3,220
  • 1
  • 18
  • 39
  • The difference between 'A' - 'Z' and 'a' - 'z' is one bit. clear or set that bit and to map one to the other. Note: this won't work for other letters. – Peter Lawrey Apr 24 '13 at 07:01

4 Answers4

3

It works because, in ASCII (which is identical to the lower part of Unicode), the bit pattern for A is 0100 0001 (0x41) while a is 0110 0001 (0x61).

Hence 'a' - 'A' is 0x20 or 0010 0000, which is the bit you have to clear on a lower case letter to make it upper case.

You can see that from the following table, upper case ranges from 0x41 through 0x5a and the equivalent lower case letters from 0x61 through 0x7a:

enter image description here

The way to clear a bit position is to and it with the logical negation of that position.

Hence (using 8 bits), ~0x20 is 0xdf (1101 1111) and the operation for the letter c goes like this:

  0110 0011     0x63 or 'c'
& 1101 1111     0xdf, or ~('a' - 'A')
  ---- ----
  0100 0011     0x43 or 'C'

For a more detailed coverage of the bitwise operators, see here.


Of course, it bears mentioning that you could far more easily just use:

c = Character.toUpperCase (c);

without having to write your own function. See here for details. And, if you really wanted to limit it to the ASCII lowercase letters, you could still wrap it in the if statement, though I would suggest that's not such a good idea if you want your application used in the non-ASCII world.

Community
  • 1
  • 1
paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
2

Let us take an example to understand your code.

1) Suppose you pass the character 'p' to your method toUpper().

2) Now the condition in your if-statement will be always true since any alphabet is between 'a' and 'z'.

3) Inside if-statement you have the following code

return (char) (c & ~('a' - 'A'));

4) In above statement this part

('a' - 'A')

will be always executed first since it is in parenthesis. Here you are just subtracting 'A' from 'a' i.e. 97-65 which are ASCII values. (A---> 65 and a----> 97). So the answer would be always 32 regardless of the character you pass to your toUpper() method.

5) How then there is the operator ~ i.e.

~('a' - 'A')

As I told the answer of ('a' - 'A') would be always 32, therefore the operator ~ is applied on the number 32 i.e.

~32

To predict the output of the operator ~, the formula is as shown below

~(number)
= -(number) - 1


Since here the number is 32 therefore the output of ~32 from above formula is

-(32) - 1
= -32 - 1
= -33

So the output of

~('a' - 'A')

would be always -33

5) Now you have

(c & ~('a' - 'A'))

i.e

(c & -33)

Here c has the alphabet which user passes, in our example it is 'p'

i.e.

p & -33

Since the ASCII value of 'p' is 112 i.e

112 & -33

i.e.

1110000 & 1011111

which are the corresponding value of 112 and -33 in binary

So after applying the operator & we get

1010000

6) Now convert 1010000 into decimal we get 80 which is the ASCII value of uppercase alphabet 'P'.

7) So in general we can say the operation that is going to be performed is

(ASCII value of user inputted alphabet) & -33

8) One more thing "Java goes for Unicode. But the first set of characters in Unicode are ASCII, also said by @paxdiablo. So I have mentioned ASCII in above answer.

Rameshwar.S.Soni
  • 630
  • 1
  • 8
  • 22
1

In Java, chars are just unsigned 16bit ints.

For ascii, the lowercase characters are 97-122 with the corresponding uppercase letters 65-90. So you just need to subtract 32 from each value.

If you look at the code (char) (c & ~('a' - 'A')), you can just start by evaluating it inside out.

'a' - 'A' gives you the negative of the required offset, 32.

~(32) flips the bits, so you now have -33. This is a bit mask with everything except the sixth to last bit 1.

(c & -33) applies the bitmask, setting the appropriate bit to 0. This effectively subtracts 32.

(char) just does an unecessary cast back to char (the compiler will insert a cast anyway).

Antimony
  • 37,781
  • 10
  • 100
  • 107
0

Characters are represented as number. Please have a look at ascii table to see what number each character is being mapped to.

The bitwise operation above basically changes the number corresponding to the lowercase letter into that of uppercase latter

gerrytan
  • 40,313
  • 9
  • 84
  • 99