How to correctly read url content with utf8 chars?

Question

    public class URLReader {
         public static byte[] read(String from, String to, String string){
          try {
           String text = "http://translate.google.com/translate_a/t?"+
                        "client=o&text="+URLEncoder.encode(string, "UTF-8")+
                        "&hl=en&sl="+from+"&tl="+to+"";

           URL url = new URL(text);
           BufferedReader in = new BufferedReader(
                        new InputStreamReader(url.openStream(), "UTF-8"));
           String json = in.readLine();
           byte[] bytes = json.getBytes("UTF-8");
           in.close();
           return bytes;
                    //return text.getBytes();
          }
          catch (Exception e) {
           return null;
          }
         }
        }

and:

public class AbcServlet extends HttpServlet {
 public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
  resp.setContentType("text/plain;charset=UTF-8");
  resp.getWriter().println(new String(URLReader.read("pl", "en", "koń")));
 }
}

When I run this i get:{"sentences"[{"trans":"end","orig":"koďż˝","translit":"","src_translit":""}],"src":"pl","server_time":30} so utf doesnt work correctly but if i return encoded url: http://translate.google.com/translate_a/t?client=o&text=ko%C5%84&hl=en&sl=pl&tl=en and paste at url bar i get correctly:{"sentences":[{"trans":"horse","orig":"koń","translit":"","src_translit":""}],"dict":[{"pos":"noun","terms":["horse"]}],"src":"pl","server_time":76}

gigadot · Accepted Answer · 2010-12-29T18:02:59.533

2

byte[] bytes = json.getBytes("UTF-8");

gives you a UTF-8 bytes sequences so URLReader.read also give you UTF-8 bytes sequences

but you tried to decode with without specifying the encoder, i.e. new String(URLReader.read("pl", "en", "koń")) so Java will use your system default encoding to decode (which is not UTF-8)

Try :

new String(URLReader.read("pl", "en", "koń"), "UTF-8")

Update

Here is fully working code on my machine:

public class URLReader {

    public static byte[] read(String from, String to, String string) {
        try {
            String text = "http://translate.google.com/translate_a/t?"
                    + "client=o&text=" + URLEncoder.encode(string, "UTF-8")
                    + "&hl=en&sl=" + from + "&tl=" + to + "";
            URL url = new URL(text);
            URLConnection conn = url.openConnection();
            // Look like faking the request coming from Web browser solve 403 error
            conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
            String json = in.readLine();
            byte[] bytes = json.getBytes("UTF-8");
            in.close();
            return bytes;
            //return text.getBytes();
        } catch (Exception e) {
            System.out.println(e);
            // becarful with returning null. subsequence call will return NullPointException.
            return null;
        }
    }
}

Don't forget to escape ń to \u0144. Java compiler may not compile Unicode text properly so it is good idea to write it in plain ASCII.

public class AbcServlet extends HttpServlet {

    @Override
    public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
        resp.setContentType("text/plain;charset=UTF-8");
        byte[] read = URLReader.read("pl", "en", "ko\u0144");
        resp.getOutputStream().write(read) ;
    }
}

edited Dec 29 '10 at 18:02

answered Dec 29 '10 at 15:14

gigadot

8,879
7
35
51

hmm now returns {"sentences":[{"trans":"end","orig":"ko�","translit":"","src_translit":""}],"src":"pl","server_time":20} – Infinity Dec 29 '10 at 15:21
Is that from your web browser? Don't use PrinWriter when you are dealing with encoded bytes. The PrintWriter will use JVM default encoder which is not UTF-8. Try getOutputStream.write((new String(URLReader.read("pl", "en", "koń"), "UTF-8") ).getBytes("UTF-8")) – gigadot Dec 29 '10 at 15:32
Note that setting resp.setContentType("text/plain;charset=UTF-8"); does not really tell your servlet to encode it with UTF-8. It is simply to inform target web browser/client that you are going to send a stream of bytes encoded with UTF-8. The actual content encoding does not need to match the content-type header. (surely you don't want that) – gigadot Dec 29 '10 at 15:37
i don't need to write this, i need save correctly data to db, but i dont see a good way to get certainty – Infinity Dec 29 '10 at 15:40
I attemped your codes but I got 403 error from google server. It doesn't allow me to use its translator. – gigadot Dec 29 '10 at 16:14
Read from a UTF-8 text file directly. There is no way to store unicode inside Java code properly without escaping it. I don't know much about whether java compiler allow the use of unicode in the code or not but it is safe not to use it. – gigadot Dec 29 '10 at 16:45
You may only need to escape it if you want to explicitly use unicode in java code. If you get the input from file, textfield, etc. there is no need to escape it. You only need to make sure that the correct encoding is used. I can see here that you are making a prototype which is why you hard coded that in so I just mentioned that to let you know. – gigadot Dec 29 '10 at 17:00
I have made a slight change. You can write bytes from URLReader.read to OutputStream directly. – gigadot Dec 29 '10 at 18:04

How to correctly read url content with utf8 chars?

1 Answers1

Linked