How to convert Strings to and from UTF8 byte arrays in Java

Question

In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?

score 445 · Accepted Answer · edited Apr 23 '22 at 07:56

445

Convert from String to byte[]:

String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);

Convert from byte[] to String:

byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);

You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.

edited Apr 23 '22 at 07:56

wovano

4,543
5
22
49

answered Sep 18 '08 at 00:16

mcherm

23,999
10
44
50

1

This method, however, will not report any problems in the conversion. This may be what you want. If not, it is recommended to use CharsetEncoder instead. – Michael Piefel Aug 17 '11 at 20:57
Why did you use `UTF-8` instead of `utf8` (which I always use) ? – Pacerier Jan 12 '12 at 10:54
7

@Pacerier because [the docs for Charset](http://docs.oracle.com/javase/6/docs/api/java/nio/charset/Charset.html) list "UTF-8" as one of the standard charsets. I believe that your spelling is also accepted, but I went with what the docs said. – mcherm Jan 17 '12 at 19:44
There is a problem using two of this Strngs: when you compare it doesn work – gal007 Feb 19 '13 at 14:50
26

Since JDK7 you can use StandardCharsets.UTF_8 https://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html#UTF_8 – Rafael Membrives Apr 15 '16 at 09:26

M. Leonhard · Answer 2 · 2010-08-03T05:02:09.170

101

Here's a solution that avoids performing the Charset lookup for every conversion:

import java.nio.charset.Charset;

private final Charset UTF8_CHARSET = Charset.forName("UTF-8");

String decodeUTF8(byte[] bytes) {
    return new String(bytes, UTF8_CHARSET);
}

byte[] encodeUTF8(String string) {
    return string.getBytes(UTF8_CHARSET);
}

edited Aug 03 '10 at 05:02

answered Aug 02 '10 at 09:53

M. Leonhard

1,332
1
18
20

That's a good point... if performance is critical, then this would save a tiny amount of time. Only significant inside a very tight loop that isn't doing much else, but it could be helpful. – mcherm Aug 06 '10 at 15:39
4

@mcherm: Even if the performance difference is small, I prefer using objects (Charset, URL, etc) over their string forms when possible. – Bart van Heukelom Dec 07 '10 at 09:08
7

Note: "Since 1.6" public String(byte[] bytes, Charset charset) – leo Jan 20 '12 at 15:49
1

Regarding "avoids performing the Charset lookup for every conversion"... please cite some source. Isn't java.nio.charset.Charset built **on top** of String.getBytes and therefore has more overhead than String.getBytes? – Pacerier Jul 14 '12 at 22:43
2

The docs do state: "The behavior of this method when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required." – paiego Oct 19 '13 at 20:30
28

Note: since Java 1.7, you can use `StandardCharsets.UTF_8` for a constant way of accessing the UTF-8 charset. – Kat Jul 29 '14 at 23:27
bad there is no parameter to set offset/length for byte[] – user25 May 09 '18 at 21:47

score 19 · Answer 3 · edited Nov 18 '15 at 12:53

19

String original = "hello world";
byte[] utf8Bytes = original.getBytes("UTF-8");

edited Nov 18 '15 at 12:53

Marged

10,577
10
57
99

answered Sep 18 '08 at 00:13

Jorge Ferreira

96,051
25
122
132

Thanks! I wrote it up again myself adding the other direction of conversion. – mcherm Sep 18 '08 at 00:18
1

@smink The dash in not optional. This should use "UTF-8" – Mel Nicholson Jul 17 '13 at 21:50

score 15 · Answer 4 · answered Sep 18 '08 at 11:32

15

You can convert directly via the String(byte[], String) constructor and getBytes(String) method. Java exposes available character sets via the Charset class. The JDK documentation lists supported encodings.

90% of the time, such conversions are performed on streams, so you'd use the Reader/Writer classes. You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters.

answered Sep 18 '08 at 11:32

McDowell

107,573
31
204
267

Can you elaborate? If my application encodes and decodes Strings in `UTF-8`, what's the concern regarding multibytes characters? – raffian Dec 03 '13 at 03:45
@raffian Problems can occur if you don't transform all the character data in one go. See [here](http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_stringclass) for an example. – McDowell Dec 03 '13 at 09:00

score 14 · Answer 5 · answered Oct 19 '13 at 20:38

14

My tomcat7 implementation is accepting strings as ISO-8859-1; despite the content-type of the HTTP request. The following solution worked for me when trying to correctly interpret characters like 'é' .

byte[] b1 = szP1.getBytes("ISO-8859-1");
System.out.println(b1.toString());

String szUT8 = new String(b1, "UTF-8");
System.out.println(szUT8);

When trying to interpret the string as US-ASCII, the byte info wasn't correctly interpreted.

b1 = szP1.getBytes("US-ASCII");
System.out.println(b1.toString());

answered Oct 19 '13 at 20:38

paiego

3,619
34
43

9

FYI, as of Java 7 you can use constants for those charset names such as [`StandardCharSets.UTF_8`](http://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html#UTF_8) and [`StandardCharSets.ISO_8859_1`](http://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html#ISO_8859_1). – Basil Bourque Jun 27 '14 at 23:20
Saved my day, working absolutely fine for the first solution mentioned above. – Hassan Jamil Apr 17 '18 at 08:11
Correction: it should be [StandardCharsets.UTF_8](http://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html#UTF_8) and [StandardCharsets.ISO_8859_1](http://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html#ISO_8859_1) (lowercase 's') – Thomas Mueller Nov 03 '22 at 08:00

score 9 · Answer 6 · answered May 11 '15 at 14:32

9

As an alternative, StringUtils from Apache Commons can be used.

 byte[] bytes = {(byte) 1};
 String convertedString = StringUtils.newStringUtf8(bytes);

or

 String myString = "example";
 byte[] convertedBytes = StringUtils.getBytesUtf8(myString);

If you have non-standard charset, you can use getBytesUnchecked() or newString() accordingly.

answered May 11 '15 at 14:32

vtor

8,989
7
51
67

4

Note that this StringUtils from **Commons Codec**, not Commons Lang. – Arend v. Reinersdorff Feb 29 '16 at 14:08
Yes, bit of a gotcha! For Gradle, Maven users: *"commons-codec:commons-codec:1.10"* (at time of writing). This also comes bundled as a dependency with Apache POI, for example. Apart from that Apache Commons to the rescue, as ever! – mike rodent Mar 03 '17 at 18:38

score 5 · Answer 7 · answered May 12 '15 at 18:10

I can't comment but don't want to start a new thread. But this isn't working. A simple round trip:

byte[] b = new byte[]{ 0, 0, 0, -127 };  // 0x00000081
String s = new String(b,StandardCharsets.UTF_8); // UTF8 = 0x0000, 0x0000,  0x0000, 0xfffd
b = s.getBytes(StandardCharsets.UTF_8); // [0, 0, 0, -17, -65, -67] 0x000000efbfbd != 0x00000081

I'd need b[] the same array before and after encoding which it isn't (this referrers to the first answer).

score 3 · Answer 8 · answered Jul 01 '16 at 07:12

For decoding a series of bytes to a normal string message I finally got it working with UTF-8 encoding with this code:

/* Convert a list of UTF-8 numbers to a normal String
 * Usefull for decoding a jms message that is delivered as a sequence of bytes instead of plain text
 */
public String convertUtf8NumbersToString(String[] numbers){
    int length = numbers.length;
    byte[] data = new byte[length];

    for(int i = 0; i< length; i++){
        data[i] = Byte.parseByte(numbers[i]);
    }
    return new String(data, Charset.forName("UTF-8"));
}

Pacerier · Answer 9 · 2012-07-17T12:31:15.870

If you are using 7-bit ASCII or ISO-8859-1 (an amazingly common format) then you don't have to create a new java.lang.String at all. It's much much more performant to simply cast the byte into char:

Full working example:

for (byte b : new byte[] { 43, 45, (byte) 215, (byte) 247 }) {
    char c = (char) b;
    System.out.print(c);
}

If you are not using extended-characters like Ä, Æ, Å, Ç, Ï, Ê and can be sure that the only transmitted values are of the first 128 Unicode characters, then this code will also work for UTF-8 and extended ASCII (like cp-1252).

score 1 · Answer 10 · edited Mar 31 '16 at 22:13

1

Charset UTF8_CHARSET = Charset.forName("UTF-8");
String strISO = "{\"name\":\"א\"}";
System.out.println(strISO);
byte[] b = strISO.getBytes();
for (byte c: b) {
    System.out.print("[" + c + "]");
}
String str = new String(b, UTF8_CHARSET);
System.out.println(str);

edited Mar 31 '16 at 22:13

Debosmit Ray

5,228
2
27
43

answered Jan 15 '16 at 12:18

Nitish Raj

137
1
2
12

score 0 · Answer 11 · edited Mar 31 '16 at 22:15

0

Reader reader = new BufferedReader(
    new InputStreamReader(
        new ByteArrayInputStream(
            string.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8));

edited Mar 31 '16 at 22:15

Debosmit Ray

5,228
2
27
43

answered May 12 '15 at 12:32

Макс Даниленко

45
3

score -1 · Answer 12 · answered Jul 01 '13 at 09:30

-1

//query is your json   

 DefaultHttpClient httpClient = new DefaultHttpClient();
 HttpPost postRequest = new HttpPost("http://my.site/test/v1/product/search?qy=");

 StringEntity input = new StringEntity(query, "UTF-8");
 input.setContentType("application/json");
 postRequest.setEntity(input);   
 HttpResponse response=response = httpClient.execute(postRequest);

answered Jul 01 '13 at 09:30

Ran Adler

3,587
30
27

Does String Entity convert 'query' to utf-8 or just remember for when attaching the entity? – SyntaxRules Oct 23 '13 at 03:39

score -10 · Answer 13 · answered Feb 19 '10 at 00:04

terribly late but i just encountered this issue and this is my fix:

private static String removeNonUtf8CompliantCharacters( final String inString ) {
    if (null == inString ) return null;
    byte[] byteArr = inString.getBytes();
    for ( int i=0; i < byteArr.length; i++ ) {
        byte ch= byteArr[i]; 
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
            byteArr[i]=' ';
        }
    }
    return new String( byteArr );
}

First, it's not a conversion: it's the removal of non-printable bytes. Second, it assumes that the underlying OS' default encoding is really based on ASCII for printable characters (won't work on IBM Mainframes using EBCDIC, for instance). — Isaac, Oct 19 '13 at 22:34

How to convert Strings to and from UTF8 byte arrays in Java

13 Answers13

Linked

Related