I'm trying to get the contents of a webpage as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages.
Here is a full Java class that demonstrates what I'm referring to:
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class I18NScraper
{
static
{
System.setProperty("http.agent", "");
}
public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";
//https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
Reader r = null;
try {
URL url = new URL(page);
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.setRequestProperty("User-Agent", IE8_USER_AGENT);
Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and
* hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
r = new InputStreamReader(con.getInputStream(),charset);
StringBuilder buf = new StringBuilder();
while (true) {
int ch = r.read();
if (ch < 0)
break;
buf.append((char) ch);
}
return buf.toString();
} finally {
if(r != null){
r.close();
}
}
}
private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
public static String getDesc(String page){
Matcher m = TITLE_PATTERN.matcher(page);
if(m.find())
return m.group(1);
return page.contains("<title>")+"";
}
public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
}
}
Which outputs:
??????????? — ??????: ??????? 360 ??? ???????
Though it ought to be:
Результатов — Яндекс: Нашлось 360 млн ответов
Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 do not help, despite that being the charset listed in the source and the HTTP header.