Java GB2312 string in HTML does not display correctly

Question

I am trying to read in HTML from Chinese websites and get their <title> value. All the websites with UTF-8 encoding works fine, but not for GB2312 websites (for example, m.39.net, which shows 39��_�й��ȵĽ��Ż��վ instead of 39健康网_中国领先的健康门户网站).

Here is the code I use to accomplish that:

URL url = new URL(urlstr);
URLConnection connection = url.openConnection();
inputStream = connection.getInputStream();
String content = IOUtils.toString(inputStream);

score 1 · Accepted Answer · edited May 23 '17 at 11:44

1

String content = IOUtils.toString(inputStream, "GB2312"); may do the help.

If you want to detect the charset of a webpage, there are 3 ways as far as I know:

use connection.getContentEncoding() to get the charset described in the HTTP header;
parse <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> or <meta charset="UTF-8"> in the HTML code (have to download the HTML content first and then read several lines);
use 3rd party libraries. E.g. those mentioned in this question.

edited May 23 '17 at 11:44

Community

1
1

answered Jan 12 '16 at 03:14

xiGUAwanOU

325
2
16

This would work for the example site, but I need a generic way to determine the site title encoding. – WoLfPwNeR Jan 12 '16 at 19:23
@WoLfPwNeR I've already updated my answer, hope this could help. – xiGUAwanOU Jan 12 '16 at 21:27
Thanks! I was able to solve the problem using the first 2 bullet points and this post: http://stackoverflow.com/questions/9501237/read-stream-twice – WoLfPwNeR Jan 12 '16 at 22:09

score 0 · Answer 2 · edited Jan 12 '16 at 03:25

0

Have you seen http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html

toString(byte[] input, String encoding)

edited Jan 12 '16 at 03:25

sideshowbarker

81,827
26
193
197

answered Jan 12 '16 at 03:23

xiaoming

1,069
8
10

Seems like the earlier answer at http://stackoverflow.com/a/34735065/441757 had already suggested using `IOUtils.toString()`... – sideshowbarker Jan 12 '16 at 03:27

Java GB2312 string in HTML does not display correctly

2 Answers2