1

I am trying to read in HTML from Chinese websites and get their <title> value. All the websites with UTF-8 encoding works fine, but not for GB2312 websites (for example, m.39.net, which shows 39������_�й����ȵĽ����Ż���վ instead of 39健康网_中国领先的健康门户网站).

Here is the code I use to accomplish that:

URL url = new URL(urlstr);
URLConnection connection = url.openConnection();
inputStream = connection.getInputStream();
String content = IOUtils.toString(inputStream);
WoLfPwNeR
  • 1,148
  • 4
  • 11
  • 27

2 Answers2

1

String content = IOUtils.toString(inputStream, "GB2312"); may do the help.

If you want to detect the charset of a webpage, there are 3 ways as far as I know:

  1. use connection.getContentEncoding() to get the charset described in the HTTP header;
  2. parse <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> or <meta charset="UTF-8"> in the HTML code (have to download the HTML content first and then read several lines);
  3. use 3rd party libraries. E.g. those mentioned in this question.
Community
  • 1
  • 1
xiGUAwanOU
  • 325
  • 2
  • 16
0

Have you seen http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html

toString(byte[] input, String encoding)
sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
xiaoming
  • 1,069
  • 8
  • 10
  • Seems like the earlier answer at http://stackoverflow.com/a/34735065/441757 had already suggested using `IOUtils.toString()`... – sideshowbarker Jan 12 '16 at 03:27