0

I searched into other stack before to type here and I didn't find anything similar. I have to scrape different utf-8 webpages which contain text like

"Oggi è una bellissima giornata"

the problem is on the character "è"

I extract this text with jtidy and xpath query expression and I convert it with

byte[] content = filteredEncodedString.getBytes("utf-8");
String result = new String(content,"utf-8");

where filteredEncodedString contains the text "Oggi è una bellissima giornata". This procedures works on the most webpages analyzed so far but in some case it doesn't extract a utf-8 string. Page encoding is always the same as the text is similar.

Edit on September, 14th

I modified my code as follow to get pages in utf-8 encoding:

URL url = new URL(currentUrl);
        URLConnection conn = url.openConnection();
        conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
        BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), getEncode()));

        String line="";
        String domString="";
        while((line = in.readLine()) != null) {
            domString+=line.toString();
        }

        byte[] bytes = domString.getBytes("UTF-8");
        in.close();

        return bytes;
        //return text.getBytes();

where getEncode() returns page encoding, utf-8 in this case. But I still noticed that ì or é are not read correctly. there is something wrong with this code? thanks again!

Edited on October, 2nd

This code seems to work. The problem was into a Dom Document creation I didn't posted (sorry about this!) with bytes returned from method above.

RAS
  • 8,100
  • 16
  • 64
  • 86

2 Answers2

1

you cannot "convert" a String to utf-8 after the fact. if the bytes have been converted to chars incorrectly, then you have already lost the data.

jtahlborn
  • 52,909
  • 5
  • 76
  • 118
  • unless you used a charset (like `ISO-8859-1`) that is a one-to-one mapping from bytes to chars. Most default charsets (`windows-12xx`) don't fall into that category though. – mihi Sep 11 '12 at 16:15
  • @mihi - yes, there are rare occasions where it might work, but in the general case, by the time you get the chars, you have already lost. – jtahlborn Sep 11 '12 at 16:28
  • sure, but in case you don't have control over the library that converts to chars, you can often set system properties like `file.encoding` to "friendly" charsets as a workaround until the library gets fixed (if ever) – mihi Sep 12 '12 at 17:26
  • @mihi - heh, good point. (as long as you don't have conflicting needs for the default encoding :) ) – jtahlborn Sep 12 '12 at 18:32
  • I updated the code according to yours suggestions. but I still doesn't work. – Marco Piccinni Sep 14 '12 at 07:21
  • @MarcoPiccinni - i would recommend reading the url content as _bytes_ and see if the characters are encoded as utf-8 in the first place. – jtahlborn Sep 14 '12 at 12:08
0

You can try to get your page as an array of bytes, not as a string and then to convert it with StringUtils to a utf-8 string.

svz
  • 4,516
  • 11
  • 40
  • 66
  • thanks, I did it but maybe there is something wrong with some page such as the most of them are parsed correctly even with special characters – Marco Piccinni Sep 12 '12 at 10:35
  • Well, it's hard to say anything not seeng these pages and parse results. are you sure that **all** pages are in utf-8? – svz Sep 12 '12 at 10:45
  • You're right but pages are all in utf-8 encoding and the result is the "è", for example, is replaced with A'', or something similar – Marco Piccinni Sep 12 '12 at 15:16
  • I updated the code according to yours suggestions. but I still doesn't work. Any idea. – Marco Piccinni Sep 14 '12 at 07:22
  • Well, I'm nearly out of ideas. You can take a look at [this](http://stackoverflow.com/questions/8934797/java-utf-8-encoding-not-set-to-urlconnection) question. Maybe you'll find something useful there. – svz Sep 14 '12 at 07:40