0

i know this question was asked many times however i'm stuck with this problem and nothing i've read helped me.

i have this code:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();

i'm trying to get content of this webpage http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/ and all nonlatin symbols have been displayed wrong.

i tried set encoding like:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "WINDOWS-1251"));

and at this point everething was well! but i cant change encoding for each website i try to parse and i need some solution.

so guys, i know that there is not that easy to detect encoding as it seems but i'm realy need it. if someone had such problem please explain me how you have solved it!

any help appriciated!

this is entire code of the function i'm using to get content:

protected Map<String, String> getFromUrl(String url){
    Map<String, String> mp = new HashMap<String, String>();
    String newCookie = "", redirect = null;
    try{
        String host = this.getHostName(url), content = "", header = "", UA = this.getUA(), cookie = this.getCookie(host, UA), referer = "http://"+host+"/";
        URL U = new URL(url);
        URLConnection conn = U.openConnection();
        conn.setRequestProperty("Host", host);
        conn.setRequestProperty("User-Agent", UA);
        conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        conn.setRequestProperty("Accept-Language", "ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3");
        conn.setRequestProperty("Accept-Encoding", "gzip,deflate");
        conn.setRequestProperty("Accept-Charset", "utf-8;q=0.7,*;q=0.7");
        conn.setRequestProperty("Keep-Alive", "115");
        conn.setRequestProperty("Connection", "keep-alive");
        conn.setRequestProperty("Connection", "keep-alive");
        if(referer != null)conn.setRequestProperty("Referer", referer);
        if(cookie != null && !cookie.contentEquals(""))conn.setRequestProperty("Cookie", cookie);
        for(int i=0; ; i++){
            String name = conn.getHeaderFieldKey(i);
            String value = conn.getHeaderField(i);
            if(name == null && value == null)break; 
            else if(name != null)if(name.contentEquals("Set-Cookie"))newCookie += value + " ";
            else if(name.toLowerCase().trim().contentEquals("location"))redirect = value;
            header += name + ": " + value + "\r\n";
        }
        if(!newCookie.contentEquals("") && !newCookie.contentEquals(cookie))this.setCookie(host, UA, newCookie.trim());
        try{
            BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            while((line = reader.readLine()) != null)content += line+"\r\n";
            reader.close();
        }
        catch(Exception e){/*System.out.println(url+"\r\n"+e);*/}
        mp.put("url", url);
        mp.put("header", header);
        mp.put("content", content);
    }
    catch(Exception e){
        mp.put("url", "");
        mp.put("header", "");
        mp.put("content", "");
    }
    if(redirect != null && this.redirectCount < 3){
        mp = getFromUrl(redirect);
        this.redirectCount++;
    }
    return mp;
}
SuperYegorius
  • 754
  • 6
  • 24

2 Answers2

1

Use jsoup for example. Detecting character encoding of a random website is complex issue because of lying/non-existent headers and 2 different meta tags. For example, the page you linked doesn't send the charset in Content-Type header.

And you're going to need a HTML parser anyway, you didn't think of going with a regex, did you?

Here's example usage:

Connection connection = Jsoup.connect("http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/");
connection
    .header("Host", host)
    .header("User-Agent", UA)
    .header("Accept", "text/html,application/xhtml+xml,application/xmlq=0.9,*/*q=0.8")
    .header("Accept-Language", "ru-ru,ruq=0.8,en-usq=0.5,enq=0.3")
    .header("Accept-Encoding", "gzip,deflate")
    .header("Accept-Charset", "utf-8q=0.7,*q=0.7")
    .header("Keep-Alive", "115")
    .header("Connection", "keep-alive");

connection.followRedirects(true);

Document doc = connection.get();

Map<String, String> cookies = connection.response().cookies();

Elements titles = doc.select(".title");
for( Element title : titles ) {
    System.out.println(title.ownText());
}

Output:

Шины Marangoni E-COMM
Описание шины Marangoni E-COMM
Community
  • 1
  • 1
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • actually i'm using jsoup, and i think i should use it for getting content. the main reason i don't want to use it when i'm getting content from a website is because i try to imitate browser. i've just edited my post and posted the very function. – SuperYegorius Mar 29 '13 at 14:56
  • 1
    @SuperYegorius why are you not using Jsoup for that? I have added code to imitate browser using Jsoup. You are basically redoing everything jSoup already does the hard way :P – Esailija Mar 29 '13 at 15:09
  • @SuperYegorius note that I left some stuff out like setting request cookies, referer etc.. but you can read the [documentation](http://jsoup.org/apidocs/) for how to set those. – Esailija Mar 29 '13 at 15:23
0

You want to look for the 'Content-Type' header:

Content-Type: text/html; charset=utf-8

The "charset" part there is what you're looking for.

Ladlestein
  • 6,100
  • 2
  • 37
  • 49