I searched into other stack before to type here and I didn't find anything similar. I have to scrape different utf-8 webpages which contain text like
"Oggi è una bellissima giornata"
the problem is on the character "è"
I extract this text with jtidy and xpath query expression and I convert it with
byte[] content = filteredEncodedString.getBytes("utf-8");
String result = new String(content,"utf-8");
where filteredEncodedString contains the text "Oggi è una bellissima giornata". This procedures works on the most webpages analyzed so far but in some case it doesn't extract a utf-8 string. Page encoding is always the same as the text is similar.
Edit on September, 14th
I modified my code as follow to get pages in utf-8 encoding:
URL url = new URL(currentUrl);
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), getEncode()));
String line="";
String domString="";
while((line = in.readLine()) != null) {
domString+=line.toString();
}
byte[] bytes = domString.getBytes("UTF-8");
in.close();
return bytes;
//return text.getBytes();
where getEncode() returns page encoding, utf-8 in this case. But I still noticed that ì or é are not read correctly. there is something wrong with this code? thanks again!
Edited on October, 2nd
This code seems to work. The problem was into a Dom Document creation I didn't posted (sorry about this!) with bytes returned from method above.