1

ElasticSearch is a search Server which accepts data only in UTF8.

When i tries to give ElasticSearch following text

Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"

Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -

byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");

Here £ is converted to

But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.

Chris Eberle
  • 47,994
  • 12
  • 82
  • 119
Vineeth Mohan
  • 18,633
  • 8
  • 63
  • 77
  • 1
    @VineethMohan why are you using `getBytes("ISO-8859-1")`? I thought you need to work in UTF-8? – buruzaemon Dec 16 '11 at 04:22
  • I need to identiy the base encoding. I am assuming the encoding of the text as ISO-8859-1 – Vineeth Mohan Dec 16 '11 at 05:04
  • Does the page declare an encoding? What do the actual bytes look like? If the mystery character shows as 0xA3 then it's 8859-1 or similar; in UTF8 it's 0xC2 0xA3 – tripleee Dec 16 '11 at 06:47
  • Is there some way any character can be moved to utf8 , something like escaping – Vineeth Mohan Dec 16 '11 at 08:07
  • possible duplicate of [How do I convert between ISO-8859-1 and UTF-8 in Java?](http://stackoverflow.com/questions/652161/how-do-i-convert-between-iso-8859-1-and-utf-8-in-java) – brian d foy Dec 16 '11 at 10:33
  • The problem is in the code that reads the web page into a string. See http://stackoverflow.com/questions/6188901 – McDowell Dec 16 '11 at 10:41

3 Answers3

3

You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.

To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.

Community
  • 1
  • 1
brian d foy
  • 129,424
  • 31
  • 207
  • 592
0

UTF-8 is easier than one thinks. In String everything is unicode characters. Bytes/string conversion is done as follows. (Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use that one.)

BufferedReader in = new BufferedReader(
        new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
        new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping

To see why Cp1252 is more suitable than ISO-8859-1: http://en.wikipedia.org/wiki/Windows-1252

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
-1

String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.

When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.

When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).

What you want to do is:

byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");
Torben
  • 3,805
  • 26
  • 31