0

This code is supposed to convert a character strings to binary ones, but with a few strings, it returns a String with 16 binary digits, not 8 as I expected them to be.

public class aaa {        
    public static void main(String argv[]){
        String nux="ª";
        String nux2="Ø";
        String nux3="(";
        byte []bites = nux.getBytes();
        byte []bites2 = nux2.getBytes();
        byte []bites3 = nux3.getBytes();
               System.out.println(AsciiToBinary(nux));
               System.out.println(AsciiToBinary(nux2));
               System.out.println(AsciiToBinary(nux3));
               System.out.println("number of bytes :"+bites.length);
               System.out.println("number of bytes :"+bites2.length);
               System.out.println("number of bytes :"+bites3.length);


    }

    public static String AsciiToBinary(String asciiString){  

          byte[] bytes = asciiString.getBytes();  
          StringBuilder binary = new StringBuilder();  
          for (byte b : bytes)  
          {  
             int val = b;  
             for (int i = 0; i < 8; i++)  
             {  
                binary.append((val & 128) == 0 ? 0 : 1);  
                val <<= 1;  
             }  
             binary.append(' ');
          }  
          return binary.toString();  
    } 

}

in the first two strings, I don't understand why they return 2 bytes, since they are single-character strings.

Compiled here to: https://ideone.com/AbxBZ9

This returns:

11000010 10101010 
11000011 10011000 
00101000 
number of bytes :2
number of bytes :2
number of bytes :1

I am using this code: Convert A String (like testing123) To Binary In Java

NetBeans IDE 8.1

Community
  • 1
  • 1
Cesar
  • 9
  • 2
  • 1
    What makes you think that the number of characters is the same as the number of bytes? There's tens of thousands of symbols out there. They can't all be represented with a single byte. It strongly depends on the encoding you use, but multi-byte encodings are rather common. – Ingo Bürk Jan 30 '16 at 22:45
  • Note that `getBytes` can take an argument for the character set you want to use. – Ingo Bürk Jan 30 '16 at 22:47
  • The ASCII code only has 256 symbols (one per possible byte value). The lower 128 symbols are the same as UTF-8, ISO-8859-1, and other popular encodings; so as long as you do not use non-english symbols, you may think that everything is just ASCII. – tucuxi Jan 30 '16 at 22:50
  • There are more possible characters than possible byte values. So clearly not all characters can be encoded in a single byte. – David Schwartz Mar 02 '16 at 00:18

2 Answers2

6

A character is not always 1-byte long. Think about it - many languages, such as Chinese or Japanese, have thousands of characters, how would you map those characters to bytes?

You are using UTF-8 (one of the many, many ways of mapping characters to bytes) - looking up a character table for UTF-8, and searching for the sequence 11000010 10101010, I arrive at

U+00AA  ª   11000010 10101010

Which is the UTF-8 encoding for ª. UTF-8 is often the default character encoding (charset) for Java -- but you cannot rely on this. That is why you should always specify a charset when converting strings to bytes or vice-versa

tucuxi
  • 17,561
  • 2
  • 43
  • 74
-1

you can understand why some character are two bytes by running this simple code

    // integer - binary 
    System.out.println(Byte.MIN_VALUE);             
    // -128 - 0b11111111111111111111111110000000

    System.out.println(Byte.MAX_VALUE);             
    // 127 - 0b1111111

    System.out.println((int) Character.MIN_VALUE);  
    // 0   - 0b0

    System.out.println((int) Character.MAX_VALUE);  
    // 65535 - 0b1111111111111111

as you can see ,we can show Byte.MAX_VALUE with just 7 bits or 1 byte (01111111)

if you cast Character.MIN_VALUE to integer, it will be : 0
we can show it's binary format with one bit or 1 byte (00000000)!

but what about Character.MAX_VALUE ?

in binary format it's 1111111111111111 which is 65535 in decimal format
and can be shown with 2 bytes (11111111 11111111).

so characters which their decimal format is between 0 and 65535 can be shown with 1 or 2 bytes.

hope you understand.

Rahmat Waisi
  • 1,293
  • 1
  • 15
  • 36
  • Your code only proves that `Character.MAX_VALUE` requires at least 2 bytes, but does not explain why some chars fit in a byte and others don't. The binary value of Byte.MIN_VALUE is also not that useful (1073741792 is -128 only when interpreted as 4-byte two's complement. In that sense, 0b10000000 is clearer & shorter: -128 in 1-byte two's complement). – tucuxi Feb 15 '16 at 10:45
  • So why you don't edit my post,I said what I know, instead of down voting my or other posts,try to edit them,we are here to share our knowledge,I did now is your turn :) – Rahmat Waisi Feb 15 '16 at 18:19
  • If you edit your post so that it answers the question, I will be happy to change my vote. I have not looked at your other posts. You are responsible for editing your own posts if they are wrong or don't answer the question: share knowledge, but take responsibility to make sure it is accurate. – tucuxi Feb 15 '16 at 18:35