87

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.

I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!

Here's my only problem now:

The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.

The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?

Toon Krijthe
  • 52,876
  • 38
  • 145
  • 202
Josh
  • 871
  • 1
  • 6
  • 4
  • Duplicate of http://stackoverflow.com/questions/1536054/how-to-convert-byte-array-to-string-and-vice-versa – james.garriss Sep 05 '13 at 18:53
  • Hi, I know it is a really old post but I am facing similar problems. I am making a man-in-the-middle proxy for ssl. The problem that I am facing is same as yours. I listen to the socket and get the data into `InputStream` and then into `byte[]`. Now when I am trying to convert the `byte[]` into String (I need to use the response body for attacks), I get really funny characters full of smart quotes and question marks and what not. I believe yours problem is same as mine as we both are dealing with `html` in `byte[]`. Can you please advice? – Parul S May 29 '14 at 13:06
  • By the way, I went to the extent to find the encoding of my system using Sytem.properties and found it to be "Cp1252". Now, I used `String str=new String(buffer, "Cp1252");` but no help. – Parul S May 29 '14 at 13:12
  • possible duplicate of [What is character encoding and why should I bother with it](http://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it) – Raedwald Apr 10 '15 at 12:19

7 Answers7

141

The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:

String decoded = new String(bytes, "UTF-8");  // example for one encoding type

By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.


-109 = 0x93: Control Code "Set Transmit State"

The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.

0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:

System.out.println(new String(new byte[]{-109}, "Cp1252")); 
nalply
  • 26,770
  • 15
  • 78
  • 101
Andreas Dolk
  • 113,398
  • 19
  • 180
  • 268
  • 6
    I tried using UTF-8 and it still came out as ?'s. How come it isn't finding a mapping for those negative values? – Josh Apr 15 '11 at 06:39
  • 0x93 is a valid continuation byte in UTF-8, though - the presence of that byte only rules out its being UTF-8 if it doesn't come after a byte with the first two bits set. – Nick Johnson Apr 18 '11 at 05:10
  • 1
    @Josh Andreas explains why - because Java's `byte` datatype is signed. The 'negative' values are just bytes with the most significant byte set. He also explains what the most likely character set you should be using is - Windows-1252. You should know what character set to use from context or convention, though, without having to guess. – Nick Johnson Apr 18 '11 at 05:11
25

Java 7 and above

You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.

For example, for UTF-8 encoding

String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
davnicwil
  • 28,487
  • 16
  • 107
  • 123
  • 1
    This is a repeat of an answer from 2011. -1 – james.garriss Sep 21 '15 at 19:18
  • 2
    @james.garriss I don't think it is, insofar as I'm just mentioning a new constructor introduced in java 7 allowing the encoding to be passed as a constant, which in my opinion is nicer, and safer, than the previous api mentioned in the earlier answers where the encoding was passed as a String, if at all. – davnicwil Sep 22 '15 at 03:44
11

You can try this.

String s = new String(bytearray);
Flexo
  • 87,323
  • 22
  • 191
  • 272
Muhammad Aamir Ali
  • 20,419
  • 10
  • 66
  • 57
5
public static String readFile(String fn)   throws IOException 
{
    File f = new File(fn);

    byte[] buffer = new byte[(int)f.length()];
    FileInputStream is = new FileInputStream(fn);
    is.read(buffer);
    is.close();

    return  new String(buffer, "UTF-8"); // use desired encoding
}
craig
  • 138
  • 2
  • 8
5
public class Main {

    /**
     * Example method for converting a byte to a String.
     */
    public void convertByteToString() {

        byte b = 65;

        //Using the static toString method of the Byte class
        System.out.println(Byte.toString(b));

        //Using simple concatenation with an empty String
        System.out.println(b + "");

        //Creating a byte array and passing it to the String constructor
        System.out.println(new String(new byte[] {b}));

    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        new Main().convertByteToString();
    }
}

Output

65
65
A
Adi Sembiring
  • 5,798
  • 12
  • 58
  • 70
4

I suggest Arrays.toString(byte_array);

It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.

Questioner
  • 662
  • 1
  • 10
  • 26
  • Can you give more information on why you're suggesting this? (Will it solve the problem? Can you say why it solves it?) Thanks! – Dean J Jun 21 '15 at 20:24
  • It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character. – Questioner Jun 21 '15 at 20:44
  • @sas, you should add this information to your answer itself (by editing it) rather than as a comment. Generally on SO you should always keep in mind that comments may at any point be deleted - the _really_ important information should be in the answer itself. – Jeen Broekstra Jun 21 '15 at 21:05
3

The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.

To work out whether it is Java or your display that is a problem, do this:

    for(int i=0;i<str.length();i++) {
        char ch = str.charAt(i);
        System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
    }

Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.

Simon G.
  • 6,587
  • 25
  • 30