15

I have found one example in Data and Communication Networking book written by Behrouza Forouzan regarding upper- and lowercase letters which differ by only one bit in the 7 bit code.

For example, character A is 1000001 (0x41) and character a is 1100001 (0x61).The difference is in bit 6, which is 0 in uppercase letters and 1 in lowercase letters. If we know the code for one case, we can easily find the code for the other by adding or subtracting 32 in decimal, or we can just flip the sixth bit.

What does all this mean?

I have found myself very confused with all these things. Could someone provide examples of how these things really work out?

Oliv
  • 10,221
  • 3
  • 55
  • 76
Vibhakar SInha
  • 299
  • 1
  • 5
  • 12
  • 3
    If you don't understand the explanation in your question, you may need to read about binary: http://en.wikipedia.org/wiki/Binary_numeral_system . I'm not sure you can get a better example than 'A' and 'a'. – Pascal Cuoq Aug 25 '10 at 20:23
  • 4
    I would treat this as an interesting quirk from the past. I wouldn't dream of applying a bitmask to a char to do a case insensitive test (use ToUpper() or similar) And it gets worse now that we are in unicode era - is ß = ss :) – StuartLC Aug 25 '10 at 20:24
  • I don't know what you're asking for. In your second paragraph, you seem to have described how it works with an example. If you have any specific questions that you can't answer yourself, please ask. – David Thornley Aug 25 '10 at 20:25
  • Sorry David, actually I have found all these stuff from books and am not able to understand how these things are done.So, can you explain me the same with a simple understandable example.How these things are actually carried out. – Vibhakar SInha Aug 25 '10 at 20:35
  • 2
    @Vibhakar: Your second paragraph is a simple understandable example. If you don't understand it, you're missing something more fundamental. Are you familiar with binary numbers? Bit patterns? – David Thornley Aug 25 '10 at 21:02
  • 1
    You should not try to use this property in practice because this works only for ASCII. Probably the author of the book forgot to mention that this doesn't work for Unicode or that the lowercasing/uppercasing is dependent on the 'current' language. – sorin Sep 13 '10 at 19:06
  • 1
    Vibhakar Sinha, how about selecting an answer? – ErikE Dec 22 '10 at 01:51
  • [What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?](https://stackoverflow.com/a/54585515) – Peter Cordes Nov 30 '20 at 10:20

7 Answers7

37

Let's use a case that you'll find more familiar: base 10.

  1. Suppose we have a base 10 computer, where each 10bit stores a value from 0 to 9, and a 10byte is 5 10bits long, so that each byte can store 100,000 values (0 through 99,999).

  2. You wish to assign letters to particular positions in a 10byte so that this computer can communicate text data with other computers. One way you could do this would be like so:

     00101 A    00201 a
     00102 B    00202 b
     00103 C    00203 c
     00104 D    00204 d
     00105 E    00205 e
     00106 F    00206 f
     00107 G    00207 g
     00108 H    00208 h
     00109 I    00209 i
     00110 J    00210 j
     00111 K    00211 k
     00112 L    00212 l
     00113 M    00213 m
     00114 N    00214 n
     00115 O    00215 o
     00116 P    00216 p
     00117 Q    00217 q
     00118 R    00218 r
     00119 S    00219 s
     00120 T    00220 t
     00121 U    00221 u
     00122 V    00222 v
     00123 W    00223 w
     00124 X    00224 x
     00125 Y    00225 y
     00126 Z    00226 z
    
  3. Do you see that each lower case letter differs from the upper case letter by only a single 10bit digit, in the 3rd column from the right? It didn't have to be designed this way. It was simply convenient, because then any time we want to adjust the case of a letter we can simply modify one of the digits (10bits) without caring what the rest of the number is or bothering with twenty-six different transformations when we can do one. We couldn't have chosen the second digit because instead of being 100 apart, they'd be only 10 apart and would overlap.

  4. Now, in base 2 it is exactly the same, but instead of each bit representing 0-9, it can only represent 0-1. Using eight 2-bits gives us only 256 possible combinations, 0-255. The ASCII codes for the upper and lower case letters in binary look like this:

     01000001 A        01100001 a
     01000010 B        01100010 b
     01000011 C        01100011 c
     01000100 D        01100100 d
     01000101 E        01100101 e
     01000110 F        01100110 f
     01000111 G        01100111 g
     01001000 H        01101000 h
     01001001 I        01101001 i
     01001010 J        01101010 j
     01001011 K        01101011 k
     01001100 L        01101100 l
     01001101 M        01101101 m
     01001110 N        01101110 n
     01001111 O        01101111 o
     01010000 P        01110000 p
     01010001 Q        01110001 q
     01010010 R        01110010 r
     01010011 S        01110011 s
     01010100 T        01110100 t
     01010101 U        01110101 u
     01010110 V        01110110 v
     01010111 W        01110111 w
     01011000 X        01111000 x
     01011001 Y        01111001 y
     01011010 Z        01111010 z
    

    Just the same as before, they differ by only one 2bit digit, here in the 6th column from the right. We couldn't have used a digit any farther to the right (smaller) because then the lists would have overlapped (2^5 = 32 and accordingly we used all bits 0 through 5, but 2^4 = 16, which could not cover the 26 letters of the alphabet).

  5. Just to fill things out a little, here's an example of what those binary values mean. Let's take the one for G. To understand what 01000111 means in binary:

      Pos:   7  6  5  4  3  2  1  0
      Bit:   0  1  0  0  0  1  1  1
      Val: 128 64 32 16  8  4  2  1
     Mult:   0 64  0  0  0  4  2  1
      Add: 64 + 4 + 2 + 1 = 71, which is the ASCII code for G.
    

    Doing the same thing for the letter G in the special base 10 system I constructed above:

       Pos:     4    3    2    1    0
     10Bit:     0    0    1    0    7
       Val: 10000 1000  100   10    1
      Mult:     0    0  100    0    7
       Add: 100 + 7 = 107, which is my special 10ASCII code for G.
    

    Look back at the "Val" row for binary. Do you see that starting from the right, each value is double the previous one? Doubling each time we get 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and so on. This is how a binary digit's position determines its value, just like a decimal digit's position determines its value with powers of 10: 1, 10, 100, 1000, 10000, 100000, and so on.

    I realize this seems silly because all I did was convert 107 to 107... but 107 isn't just a number, it is a shorthand form for:

     1 hundreds + 0 tens + 7 ones.
    

    Another way we could represent that is

     0 x 10^4 + 0 x 10^3 + 1 x 10^2 + 0 x 10^1 + 7 x 10^0.
    

    Similarly, 01000111 isn't just a binary number, it is a shorthand form for

     0 x 2^7 + 1 x 2^6 + 0 x 2^5 + 0 x 2^4 + 0 x 2^3 + 1 x 2^2 + 1 x 2^1 + 1 x 2^0
    

    Which is what I already showed you:

     0 + 64 + 0 + 0 + 0 + 4 + 2 + 1
     = 64 + 4 + 2 + 1
     = 71
    

Also, you may have been wondering what 0x41 and 0x61 meant. The 0x part indicates that the digits to follow are to be understood as hexadecimal, which is base 16. There are only 10 digits in our number system, so we need 6 more digits somehow. Thus, hexadecimal uses the digits 0-9 and treats the letters A-F as the remaining digits, where A is 10 up through F as 15. Hexadecimal is very convenient for computers because 16 is a power of 2, and an 8-bit byte thus takes exactly two hex digits to encode (and each hex digit encodes exactly four binary digits). Taking 0x41, expanding 4 to its binary representation 0100 and expanding 1 to its binary representation 0001 you get 01000001, which you can see is the code for A as shown. To convert it to decimal it is 4 x 16 + 1 x 1 = 65. We multiply the 4 by 16 because each successive hexadecimal digit leftward is 16 times the previous digit, following the same pattern as I showed you above for base 2 and 10.

I hope this is sufficient for you to understand a little bit more about binary and ASCII codes.

Note 1: The reason for 8 bits in a byte instead of 2 as you might think is that 8 is a better balance, as a 2-bit "byte" would only encode 4 values, and transmitting the upper and lower case letters of the alphabet alone would require 3 bytes! There is nothing inherent in binary that forces the choice of 8 bits per byte, except that 8 is also a power of 2 which makes a lot of the math involved in working with binary information simpler and things align on edges better.

In the early days of computing different systems had many different byte lengths, including 7, 9, or other numbers!) Nowadays the computing world settled on 8 as a standard and useful of bits in a byte (though, note that text sometimes require 2 – 4 bytes to fully represent all the possible characters.

I am sure that choosing something like 6 bits per byte would have worked out awkwardly, and would not have made good use of the full range of values available.

Note 2: My system of five bits in a 10byte is based on the impracticality of using ten 10bits per byte, which yields a really huge number that would waste a lot of storage space. I chose five because ten is evenly divisible by it, which would undoubtedly be useful. (Originally, my answer used ten 10bits per 10byte, but it was too darned big!)

ErikE
  • 48,881
  • 23
  • 151
  • 196
  • 7
    01010111 01010100 01000110 00111111 – Lukman Aug 25 '10 at 22:15
  • ASCII's choice to align the upper and lower-case alphabet ranges to the same position relative to a `%32` boundary allows some nice bit-manipulation tricks for stuff like detecting if a byte is an alphabetic character: [What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?](https://stackoverflow.com/a/54585515) – Peter Cordes Nov 30 '20 at 10:22
4

This relationship between the upper case and lower case letters was deliberate. When the ASCII code was formulated, computer hardware was primitive and software needed to conserve every byte. Flipping a single bit takes very little hardware or code to accomplish.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
2

In order to add or subtract 32, you first must know whether the character is greater or less than 'A'.

When this book was written, the programming languages most people were using did not have Strings, or .equalsIgnoreCase. This was pre-i18n, and when a business had a server, you would telnet to it (like xterm), and get a command line menu. What he's describing, was typically used to create a nice case-insensitive menu for your users, taking advantage of the numeric layout of the ascii table.

It can be very fast, because there are bit-wise assembler instructions to do the math in either direction, regardless of whether the characters are already upper or lowercase.

c = c | 32 // to uppercase

c = c & (1+2+4+8+16+ 0 +64+128) // to lowercase

Say you had a Java-like language, without objects or the standard libs. Your networking author is prompting you to code like this:

    public static void main()
    {
        println("What would you like to do?");
        println("Inventory (inv)");
        println("Reports (rep)");

        char[] ca = readUserInput();        
        for (int i = 0; i < ca.length; i++)
            ca[i] = ca[i] | 32;  // convert to uppercase, by ensuring bit 32 is set

        if (compareInput(ca, "INV") == true)
            doInventory();
    }

Have you tried searching Google, and sometimes capitalized a person's name?

Brian Maltzan
  • 1,327
  • 10
  • 12
1

I think most of these answers are unnecessarily complicated and occasionally condescending.

The decimal to ascii character mapping is arbitrary and doesn't really have anything to do with understanding how base 2 or base 10 works. It's purely a convenience thing. If someone mistakenly coded a lowercase character but meant an uppercase, it's more convenient to just flip one bit instead of having to recode an entire byte. It's less prone to human error to just flip one bit. IF the output is 'a' but we wanted 'A', at least we know we got most of the bit right and we just have to flip 2^5 to add or subtract 32. It's that easy. Why pick specifically bit 5 (it's not 6 as some have said, you start from 0..), well clearly that's the one that makes sense to satisfy two ranges of 26 characters with only one bit flip. If you did this on a lesser valued bit, you'd have to flip more than one.

shake
  • 39
  • 1
1

take a look, the 6th bit = 32, so if you flip it you subract or add 32

Bit value
1   1
2   2
3   4
4   8
5   16
6   32 (32 = hex 20)

Now if you look here http://asciitable.com/, you can see the ascii table for all the characters and will notice that A = 65 and a = 97

SQLMenace
  • 132,095
  • 25
  • 206
  • 225
  • Thanks for your answer but here you have missed something "a" is 97 not 96. 96 is for Grave Accent i.e. ` Sir, can you explain me all these stuff using a simple example and how actually these things are worked out. – Vibhakar SInha Aug 25 '10 at 20:32
  • I think this is a simple as it gets, the 6th bit means 32 so if it is on or off the difference is 32,which is the same difference between upper case and lower case characters – SQLMenace Aug 25 '10 at 20:36
  • Thanks Sir, it means when we have to convert lowercase to uppercase then we have to subtract the number by 32 and for uppercase to lowercase we have to add 32.Am i right, Sir. – Vibhakar SInha Aug 25 '10 at 20:39
  • 2
    Even better, to convert to lowercase, you do a bitwise OR with the value 32. To convert to upper case, do a bitwise AND with the value 223. Bitwise and's and or's are more efficient for the CPU then addition and subtraction. Also, the bit logic prevents you from needing to check the value (eliminating an IF check). – George Mastros Aug 25 '10 at 21:58
  • @ErikE True, but `ToOppositeCase()` seems less useful than `ToUpper()` and `ToLower()`. Generally the point is to normalize case in one direction or the other. – Dan Bechard Dec 28 '15 at 14:43
  • There is no claim in my comments about the usefulness of `ToggleCase` compared to`ToUpper` and `ToLower`. Are you saying that we shouldn't discuss this topic at all or that it never has any use? Your comment is like someone saying about some method `ThrowUp()` under discussion that it seems less useful than `TakeBite()` or `EatWholeMeal()`. We *know* that it isn't the same thing and used less commonly, but we sometimes might use it, anyway! Consider: once you've determined that a character is the wrong case, why not just flip it instead of worrying about which way it must go? – ErikE Dec 28 '15 at 16:20
1

http://asciitable.com/

0x61 is hexadecimal for 97 = a
0x41 is hexadecimal for 65 = A

So subtracting/adding decimal 32 is indeed the way to convert to uppercase/lowercase.

Z is 90 = 0b1111010    = 0x5A
z is 122 = 0b1011010   = 0x7A

Which is a difference of 0b01000000 in binary or 0x20 or 32 in decimal.

Thus switching the 6th bit changes case.

Gazler
  • 83,029
  • 18
  • 279
  • 245
0
template<char TLBound, char TUBound>
struct CharRange
{
    enum 
    {
        LBound = TLBound,
        UBound = TUBound
    };

    static bool InRange(char ch)
    {
        return (ch >= LBound)  && (ch <= UBound);
    };
};

typedef CharRange<'a', 'z'> lcaseLetters;
typedef CharRange<'A', 'Z'> ucaseLetters;

char toUpper(char ch)
{
    if(lcaseLetters::InRange(ch))
    {
        return (ch ^ (0x1 << 5));
    }

    return ch;
}

char toLower(char ch)
{
    if(ucaseLetters::InRange(ch))
    {
        return (ch ^ (0x1 << 5));
    }

    return ch;
}
Nitheesh George
  • 1,357
  • 10
  • 13