how to find out the type of character encoding in a web page using java
Asked
Active
Viewed 1,062 times
2 Answers
2
Open a connection to the URL (using URL.openConnection()), adn the parse the content type returned by the getContentType() method (which should contain the charset). If not present in this header, you might have to parse the HTML content and look for a tag such as
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

JB Nizet
- 678,734
- 91
- 1,224
- 1,255
-
I would change "might have to" to "will have to". – Lucas Zamboulis Feb 22 '11 at 12:14
-
You should also look at the XML declaration, like ``. (If existent, it should be right at the beginning of the document.) – Paŭlo Ebermann Feb 22 '11 at 13:43
1
I believe this does exactly what you need. Has both code and explanation. http://nadeausoftware.com/node/73
A quick summary is as follows:
Create a WebFile class where:
- Constructor
public WebFile( String urlString )
opens aURLConnection
, reads in the headers, including the character encoding. If the encoding is not present, then you'll have to read the encoding from the web page itself. If this is not present either, you could try your luck with Character Encoding Detection Algorithm - Method
private Object readStream(int length, java.io.InputStream stream)
reads the page data from the stream and returns aString
using the character encoding, i.e.return new String( bytes, charset )
, or returns the byte array created by reading the stream if there is no encoding present or if there's an encoding exception. - You have getters and setters for the page content (e.g. invokes readStream just once, returns the encoding)

Community
- 1
- 1

Lucas Zamboulis
- 2,494
- 5
- 24
- 27
-
2Providing *only* a link to an external resource is not a good answer. The link can go invalid and become useless. You should have at *least* a summary in your answer. – Joachim Sauer Feb 22 '11 at 12:15
-
@Joachim Sauer: didn't want to rewrite the perfectly good description of that page - but didn't think about the invalid link scenario. Fixed, thanks. – Lucas Zamboulis Feb 22 '11 at 12:32