0

So, i am using this code to get the whole HTML of a website. But i dont seem to get non-ascii characters with me. all i get is diamonds with question mark.
characters like this: å, appears like this: �
I doubt its because of the charset, what could it then be?

Log.e("HTML", "henter htmlen..");
            String url = "http://beep.tv2.dk";
            HttpClient client = new DefaultHttpClient();
            client.getParams().setParameter(CoreProtocolPNames.PROTOCOL_VERSION, 
                    HttpVersion.HTTP_1_1);
            client.getParams().setParameter(CoreProtocolPNames.HTTP_ELEMENT_CHARSET, "UTF-8");
            HttpGet request = new HttpGet(url);
            HttpResponse response = client.execute(request);
            Header h = HeaderValueFormatter
            response.addHeader(header)
            String html = "";
            InputStream in = response.getEntity().getContent();
            BufferedReader reader = new BufferedReader(new InputStreamReader(in));
            StringBuilder str = new StringBuilder();
            String line = null;
            while((line = reader.readLine()) != null)
            {
                str.append(line);
            }
            in.close();
        //b = false;
        html = str.toString();
Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140

3 Answers3

4

Thank you. This worked (in case others have the issue):

HttpClient client = new DefaultHttpClient();
    client.getParams().setParameter(CoreProtocolPNames.PROTOCOL_VERSION, 
         HttpVersion.HTTP_1_1);
    client.getParams().setParameter(CoreProtocolPNames.HTTP_ELEMENT_CHARSET, "iso-8859-1");
    HttpGet request = new HttpGet(url);
    request.setHeader("Accept-Charset", "iso-8859-1, unicode-1-1;q=0.8");
    HttpResponse response = client.execute(request);
    String html = "";
    InputStream in = response.getEntity().getContent();
    BufferedReader reader = new BufferedReader(new InputStreamReader(in,"iso-8859-1"));
Abdullah Gheith
  • 521
  • 6
  • 19
  • You need to go back to the same PC/webbrowser from where you have opened the question and then register your original user account by OpenID. This way you will be able to use the same account from every other PC/webbrowser. – BalusC Dec 24 '10 at 13:41
2
  1. use the new InputStreamReader(in, "UTF-8") constructor
  2. Set the Accept-Charset request header to, say, Accept-Charset: iso-8859-5, unicode-1-1;q=0.8
  3. Make sure the page opens properly in a browser. If it does not, then it might be a server-side issue.
  4. If none of the above works, check other headers using firebug (or similar tool)
Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
1

This really helped me get started, but I was having the same problem while reading a text file. It was fixed using the following command:

    BufferedReader br = new BufferedReader(new InputStreamReader(new 
                FileInputStream(fileName), "iso-8859-1"));

...and of course, the HTTP Response needs to have the encoding set as well:

    response.setCharacterEncoding("UTF-8");

Thanks for the help!

Chris Clark
  • 340
  • 2
  • 13