Encode String to UTF-8

Question

I have a String with a "ñ" character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn't work:

byte ptext[] = myString.getBytes();
String value = new String(ptext, "UTF-8");

How do I encode that string to utf-8?

It's unclear what exactly you're trying to do. Does myString correctly contain the ñ character and you have problems converting it to a byte array (in that case see answers from Peter and Amir), or is myString corrupted and you're trying to fix it (in that case, see answers from Joachim and me)? — Michael Borgwardt, Apr 20 '11 at 12:13
I need to send myString to a server with utf-8 encoding and I need to convert the "ñ" character to utf-8 encoding. — Alex, Apr 20 '11 at 12:20
Well, if that server expects UTF-8 then what you need to send it are bytes, not a String. So as per Peter's answer, specify the encoding in the first line and drop the second line. — Michael Borgwardt, Apr 20 '11 at 12:32
@Michael: I agree that it isn’t clear what the real intent is here. There seem to be a lot of questions where people are trying to explicit conversions between Strings and bytes rather than letting the `{In,Out}putStream{Read,Writ}ers` do it for them. I wonder why? — tchrist, Apr 21 '11 at 15:05
@tchrist: my guess is that those questions are asked by people whose previous experience is with languages like C or PHP where a string is basically the same thing as a byte array and you have to track its encoding separately (and converting a string from one encoding to another one has meaning). — Michael Borgwardt, Apr 21 '11 at 15:20
@Michael: Thanks, I suppose that makes sense. But it also makes it harder than it needs to be, doesn’t it? I am not very fond of languages that work that way, and so try to avoid working with them. I think Java’s model of Strings of characters instead of bytes makes things a whole lot easier. Perl and Python also share the “everything is Unicode strings” model. Yes, in all three you can still get at bytes if you work at it, but in practice it seems rare that you truly need to: that’s quite low-level. Plus it feels kinda like brushing a cat the wrong direction, if you know what I mean. :) — tchrist, Apr 21 '11 at 15:24
@tchrist: I completely agree that a strong string abstraction is a very good thing. But C is from a time long before Unicode existed, when there was no single encoding that could represent all characters, and when *any* kind of abstraction over pure bytes would have been an intolerable performance penalty. Java programmers are lucky that it adapted Unicode relatively well from the beginning. Perl and Python are older and had Unicode support retrofitted, which makes it much less clean (explicit str/unicode duality in Python, nasty implicit UTF-8 flag in Perl. — Michael Borgwardt, Apr 21 '11 at 15:44
@Michael: The Python duality is pretty annoying; I am always forgetting `/u` in Python; same problem with PHP. With Perl 5.14, now in [RC1 testing](http://perlmonks.org/?node_id=900327), you can **finally** get [all Unicode strings](http://cpansearch.perl.org/src/JESSE/perl-5.14.0-RC1/pod/perldelta.pod). Perl regexes are still a lot more Unicode-friendly than Java’s, but I’ve been working with the [JDK7 people to fix that](http://old.nabble.com/%3Ci18n-dev%3E-Review-request%3A-7037261%3A-j.l.Character.isLowerCase-isUpperCase-need-to-match-the-Unicode-Standard-definition-td31437357.html). — tchrist, Apr 21 '11 at 16:01
possible duplicate of [How to convert Strings to and from UTF8 byte arrays in Java](http://stackoverflow.com/questions/88838/how-to-convert-strings-to-and-from-utf8-byte-arrays-in-java) — Paco Abato, Feb 03 '15 at 07:45

score 189 · Answer 1 · edited Aug 11 '18 at 23:43

189

How about using

ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)

edited Aug 11 '18 at 23:43

leventov

14,760
11
69
98

answered Apr 20 '11 at 11:57

Amir Rachum

76,817
74
166
248

See my discussion with Peter. But if his assumption about the question is right, your solution would still not be idea since it returns a ByteBuffer. – Michael Borgwardt Apr 20 '11 at 12:09
9

But how do I obtain a encoded String? it returns a ByteBuffer – Alex Apr 20 '11 at 12:16
8

@Alex: it's *not possible* to have an UTF-8 encoded Java String. You want bytes, so either use the ByteBuffer directly (could even be the best solution if your goal is to send it via a network collection) or call array() on it to get a byte[] – Michael Borgwardt Apr 20 '11 at 12:35
Good one, short and to the point... Of course, it needs some additional steps: new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array()) – PhiLho Dec 18 '13 at 19:57
2

Something else that may be helpful is to use Guava's Charsets.UTF_8 enum instead of a String that may throw an UnsupportedEncodingException. String -> bytes: `myString.getBytes(Charsets.UTF_8)`, and bytes -> String: `new String(myByteArray, Charsets.UTF_8)`. – laughing_man Mar 12 '14 at 03:24
25

Even better, use `StandardCharsets.UTF_8`. Available in Java 1.7+. – Kat Jul 29 '14 at 23:25
1

The array return by `array()` will most likely be bigger than needed and padded, as it is the `ByteBuffer`s internal array. Better to use `string.getBytes(StandardCharsets.UTF_8)` which will return a new array with the correct size. – Chirlo Mar 31 '20 at 22:59

Joachim Sauer · Accepted Answer · 2022-03-22T08:27:22.900

156

String objects in Java use the UTF-16 encoding that can't be modified^*.

The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).

^{* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).}

edited Mar 22 '22 at 08:27

answered Apr 20 '11 at 11:58

Joachim Sauer

302,674
57
556
614

100

Technically speaking, byte[] doesn't have any encoding. Byte array PLUS encoding can give you string though. – Peter Štibraný Apr 20 '11 at 14:34
1

@Peter: true. But attaching an encoding to it only makes sense for `byte[]`, it doesn't make sense for `String` (unless the encoding is UTF-16, in which case it makes sense but it still unnecessary information). – Joachim Sauer Apr 20 '11 at 14:36
4

`String objects in Java use the UTF-16 encoding that can't be modified.` Do you have an official source for this quote? – Ahmad Hajjar Oct 25 '18 at 02:21
1

@AhmadHajjar https://docs.oracle.com/javase/10/docs/api/java/lang/Character.html#unicode : "The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes." – Maxi Gis Oct 04 '19 at 14:43
Thanks to you and rzymek for your helpful answers! You both saved my time! You theoretic part and rzymek by practical part. – Ruben Kubalyan Dec 05 '22 at 13:49

score 91 · Answer 3 · edited Apr 03 '17 at 17:29

91

In Java7 you can use:

import static java.nio.charset.StandardCharsets.*;

byte[] ptext = myString.getBytes(ISO_8859_1); 
String value = new String(ptext, UTF_8);

This has the advantage over getBytes(String) that it does not declare throws UnsupportedEncodingException.

If you're using an older Java version you can declare the charset constants yourself:

import java.nio.charset.Charset;

public class StandardCharsets {
    public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
    public static final Charset UTF_8 = Charset.forName("UTF-8");
    //....
}

edited Apr 03 '17 at 17:29

Eduardo Cuomo

17,828
6
117
94

answered Nov 27 '13 at 12:52

rzymek

9,064
2
45
59

2

This is the right answer. If someone wants to use a string datatype, he can use it in the right format. Rest of the answers are pointing to the byte formatted type. – Neeraj Shukla Feb 08 '15 at 09:36
Works in 6. Thanks. – Itsik Mauyhas Sep 26 '17 at 12:26
Correct answer for me too. One thing though, when I used as above, German character changed to ?. So, I used this: byte[] ptext = myString.getBytes(UTF_8); String value = new String(ptext, UTF_8); This worked fine. – Farhan Hafeez Feb 12 '19 at 07:23
4

The code sample doesn't make sense. If you first convert to ISO-8859-1, then that array of byte is **not** UTF-8, so the next line is totally incorrect. It will work for ASCII strings, of course, but then you could as well make a simple copy: `String value = new String(myString);`. – Alexis Wilke Aug 16 '19 at 03:09

score 77 · Answer 4 · answered Apr 20 '11 at 11:57

77

Use byte[] ptext = String.getBytes("UTF-8"); instead of getBytes(). getBytes() uses so-called "default encoding", which may not be UTF-8.

answered Apr 20 '11 at 11:57

Peter Štibraný

32,463
16
90
116

9

@Michael: he is clearly having trouble getting bytes from string. How is getBytes(encoding) missing the point? I think second line is there just to check if he can convert it back. – Peter Štibraný Apr 20 '11 at 12:01
1

I interpret it as having a broken String and trying to "fix" it by converting to bytes and back (common misunderstanding). There's no actual indication that the second line is just checking the result. – Michael Borgwardt Apr 20 '11 at 12:04
@Michael, no there isn't, it's just my interpretation. Yours is simply different. – Peter Štibraný Apr 20 '11 at 12:05
1

@Peter: you're right, we'd need clarification from Alex what he really means. Can't rescind the downvote though unless the answer is edited... – Michael Borgwardt Apr 20 '11 at 12:07

score 34 · Answer 5 · answered Apr 20 '11 at 11:58

34

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.

So if you have an encoding problem, by the time you have String, it's too late to fix. You need to fix the place where you create that String from a file, DB or network connection.

answered Apr 20 '11 at 11:58

Michael Borgwardt

342,105
78
482
720

1

It's a common mistake to believe that strings are internally encoded as UTF-16. Usually they are, but if, it is only an implementation specific detail of the String class. Since the internal storage of the character data is not accessible through the public API, a specific String implementation may decide to use any other encoding. – jarnbjo Apr 20 '11 at 12:45
5

@jarnbjo: The API explicitly states "A String represents a string in the UTF-16 format". Using anything else as internal format would be highly inefficient, and all actual implementations I know do use UTF-16 internally. So unless you can cite one that doesn't, you're engaging in pretty absurd hairsplitting. – Michael Borgwardt Apr 20 '11 at 13:30
Is it absurd to distinguish between public access and internal representation of data structures? – jarnbjo Apr 20 '11 at 15:01
1

@jarnbjo: so can you give an example for a JVM that does not internally represent Strings as UTF-16? – Michael Borgwardt Apr 20 '11 at 15:04
6

The JVM (as far as it is relevant to the VM at all) uses UTF-8 for string encoding, e.g. in the class files. The implementation of java.lang.String is decoupled from the JVM and I could easily implement the class for you using any other encoding for the internal representation if that is really necessary for you to realize that your answer is incorrect. Using UTF-16 as the internal format is in most cases highly inefficient as well when it comes to memory consumption and I don't see why e.g. Java implementations for embedded hardware wouldn't optimize for memory instead of performance. – jarnbjo Apr 20 '11 at 16:19
1

@jarnbjo: And once more: as long as you cannot give a concrete example of a JVM whose standard API implementation *does* internally use something other than UTF-16 to implement Strings, my statement is correct. And no, the String class is not really decoupled from the JVM, due to things like intern() and the constant pool. – Michael Borgwardt Apr 20 '11 at 18:25

score 25 · Answer 6 · edited Apr 20 '11 at 16:56

25

You can try this way.

byte ptext[] = myString.getBytes("ISO-8859-1"); 
String value = new String(ptext, "UTF-8");

edited Apr 20 '11 at 16:56

bstpierre

30,042
15
70
103

answered Apr 20 '11 at 12:24

user716840

301
2
2

1

I was going crazy. Thank you to get the bytes in "ISO-8859-1" first was the solution. – Hanako Jun 19 '18 at 21:22
3

This is wrong. If your string includes Unicode characters, converting it to 8859-1 is going to throw an exception or worse give you an invalid string (maybe the string without those characters with code point 0x100 and over). – Alexis Wilke Aug 16 '19 at 03:22

Quimbo · Answer 7 · 2018-04-09T02:41:08.023

In a moment I went through this problem and managed to solve it in the following way

first i need to import

import java.nio.charset.Charset;

Then i had to declare a constant to use UTF-8 and ISO-8859-1

private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");

Then I could use it in the following way:

String textwithaccent="Thís ís a text with accent";
String textwithletter="Ñandú";

text1 = new String(textwithaccent.getBytes(ISO), UTF_8);
text2 = new String(textwithletter.getBytes(ISO),UTF_8);

score 9 · Answer 8 · answered Feb 19 '15 at 19:34

String value = new String(myString.getBytes("UTF-8"));

and, if you want to read from text file with "ISO-8859-1" encoded:

String line;
String f = "C:\\MyPath\\MyFile.txt";
try {
    BufferedReader br = Files.newBufferedReader(Paths.get(f), Charset.forName("ISO-8859-1"));
    while ((line = br.readLine()) != null) {
        System.out.println(new String(line.getBytes("UTF-8")));
    }
} catch (IOException ex) {
    //...
}

score 3 · Answer 9 · answered May 04 '16 at 07:49

I have use below code to encode the special character by specifying encode format.

String text = "This is an example é";
byte[] byteText = text.getBytes(Charset.forName("UTF-8"));
//To get original string from byte.
String originalString= new String(byteText , "UTF-8");

score 2 · Answer 10 · edited Jun 20 '20 at 09:12

A quick step-by-step guide how to configure NetBeans default encoding UTF-8. In result NetBeans will create all new files in UTF-8 encoding.

NetBeans default encoding UTF-8 step-by-step guide

Go to etc folder in NetBeans installation directory
Edit netbeans.conf file
Find netbeans_default_options line
Add -J-Dfile.encoding=UTF-8 inside quotation marks inside that line

(example: netbeans_default_options="-J-Dfile.encoding=UTF-8")
Restart NetBeans

You set NetBeans default encoding UTF-8.

Your netbeans_default_options may contain additional parameters inside the quotation marks. In such case, add -J-Dfile.encoding=UTF-8 at the end of the string. Separate it with space from other parameters.

Example:

netbeans_default_options="-J-client -J-Xss128m -J-Xms256m -J-XX:PermSize=32m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true -J-Dsun.java2d.dpiaware=true -J-Dsun.zip.disableMemoryMapping=true -J-Dfile.encoding=UTF-8"

here is link for Further Details

score 0 · Answer 11 · answered Dec 09 '14 at 07:48

0

This solved my problem

    String inputText = "some text with escaped chars"
    InputStream is = new ByteArrayInputStream(inputText.getBytes("UTF-8"));

answered Dec 09 '14 at 07:48

Prasanth RJ

137
1
8

Encode String to UTF-8

11 Answers11

Linked

Related