how to detect encoding when i'm using bufferedReader

Question

i know this question was asked many times however i'm stuck with this problem and nothing i've read helped me.

i have this code:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();

i'm trying to get content of this webpage http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/ and all nonlatin symbols have been displayed wrong.

i tried set encoding like:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "WINDOWS-1251"));

and at this point everething was well! but i cant change encoding for each website i try to parse and i need some solution.

so guys, i know that there is not that easy to detect encoding as it seems but i'm realy need it. if someone had such problem please explain me how you have solved it!

any help appriciated!

this is entire code of the function i'm using to get content:

protected Map<String, String> getFromUrl(String url){
    Map<String, String> mp = new HashMap<String, String>();
    String newCookie = "", redirect = null;
    try{
        String host = this.getHostName(url), content = "", header = "", UA = this.getUA(), cookie = this.getCookie(host, UA), referer = "http://"+host+"/";
        URL U = new URL(url);
        URLConnection conn = U.openConnection();
        conn.setRequestProperty("Host", host);
        conn.setRequestProperty("User-Agent", UA);
        conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        conn.setRequestProperty("Accept-Language", "ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3");
        conn.setRequestProperty("Accept-Encoding", "gzip,deflate");
        conn.setRequestProperty("Accept-Charset", "utf-8;q=0.7,*;q=0.7");
        conn.setRequestProperty("Keep-Alive", "115");
        conn.setRequestProperty("Connection", "keep-alive");
        conn.setRequestProperty("Connection", "keep-alive");
        if(referer != null)conn.setRequestProperty("Referer", referer);
        if(cookie != null && !cookie.contentEquals(""))conn.setRequestProperty("Cookie", cookie);
        for(int i=0; ; i++){
            String name = conn.getHeaderFieldKey(i);
            String value = conn.getHeaderField(i);
            if(name == null && value == null)break; 
            else if(name != null)if(name.contentEquals("Set-Cookie"))newCookie += value + " ";
            else if(name.toLowerCase().trim().contentEquals("location"))redirect = value;
            header += name + ": " + value + "\r\n";
        }
        if(!newCookie.contentEquals("") && !newCookie.contentEquals(cookie))this.setCookie(host, UA, newCookie.trim());
        try{
            BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            while((line = reader.readLine()) != null)content += line+"\r\n";
            reader.close();
        }
        catch(Exception e){/*System.out.println(url+"\r\n"+e);*/}
        mp.put("url", url);
        mp.put("header", header);
        mp.put("content", content);
    }
    catch(Exception e){
        mp.put("url", "");
        mp.put("header", "");
        mp.put("content", "");
    }
    if(redirect != null && this.redirectCount < 3){
        mp = getFromUrl(redirect);
        this.redirectCount++;
    }
    return mp;
}

Could you show more information of 'content' parameter? – Edward Mar 29 '13 at 03:19 — Edward, Mar 29 '13 at 03:19

score 1 · Accepted Answer · edited May 23 '17 at 12:21

Use jsoup for example. Detecting character encoding of a random website is complex issue because of lying/non-existent headers and 2 different meta tags. For example, the page you linked doesn't send the charset in Content-Type header.

And you're going to need a HTML parser anyway, you didn't think of going with a regex, did you?

Here's example usage:

Connection connection = Jsoup.connect("http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/");
connection
    .header("Host", host)
    .header("User-Agent", UA)
    .header("Accept", "text/html,application/xhtml+xml,application/xmlq=0.9,*/*q=0.8")
    .header("Accept-Language", "ru-ru,ruq=0.8,en-usq=0.5,enq=0.3")
    .header("Accept-Encoding", "gzip,deflate")
    .header("Accept-Charset", "utf-8q=0.7,*q=0.7")
    .header("Keep-Alive", "115")
    .header("Connection", "keep-alive");

connection.followRedirects(true);

Document doc = connection.get();

Map<String, String> cookies = connection.response().cookies();

Elements titles = doc.select(".title");
for( Element title : titles ) {
    System.out.println(title.ownText());
}

Output:

Шины Marangoni E-COMM
Описание шины Marangoni E-COMM

actually i'm using jsoup, and i think i should use it for getting content. the main reason i don't want to use it when i'm getting content from a website is because i try to imitate browser. i've just edited my post and posted the very function. — SuperYegorius, Mar 29 '13 at 14:56
@SuperYegorius why are you not using Jsoup for that? I have added code to imitate browser using Jsoup. You are basically redoing everything jSoup already does the hard way :P — Esailija, Mar 29 '13 at 15:09
@SuperYegorius note that I left some stuff out like setting request cookies, referer etc.. but you can read the [documentation](http://jsoup.org/apidocs/) for how to set those. — Esailija, Mar 29 '13 at 15:23

score 0 · Answer 2 · answered Mar 29 '13 at 01:51

0

You want to look for the 'Content-Type' header:

Content-Type: text/html; charset=utf-8

The "charset" part there is what you're looking for.

answered Mar 29 '13 at 01:51

Ladlestein

6,100
2
37
49

how to detect encoding when i'm using bufferedReader

2 Answers2