2

I have some text encoded in ISO-8859-1 which I then extract some data from using Regex.

The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".

How do I stop the regex library from scrambling my chars?

Edit: Here's some code:

private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
    HttpGet get = new HttpGet(url);
    return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
    InputStream input = response.getEntity().getContent();
    StringBuilder builder = new StringBuilder();
    int read;
    byte[] tmp = new byte[1024];

    while ((read = input.read(tmp))!=-1)
    {
        builder.append(new String(tmp), 0,read-1);
    }

    return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff
monoceres
  • 4,722
  • 4
  • 38
  • 63
  • Strings in Java are always UTF-16, so no encoding issues there. How do you get your data in the string in the first place? I.e. how exactly do you convert from the legacy encoding? – Joey Aug 07 '10 at 16:46
  • It's html from a website. The website specified ISO-8859-1 in the head tag so I just assumed it was stored in that format as well. – monoceres Aug 07 '10 at 17:05
  • 2
    Then I would assume you already got the correct characters – or not. But in any case, I'm fairly sure the regex isn't what breaks it here. If you could provide some code what you're trying exactly, this may help. As well as details on how exactly your strings get scrambled. – Joey Aug 07 '10 at 17:25
  • The correct way to specify the encoding of a web page is through an HTTP header. That declaration, if present, supersedes anything found within the page itself, like a `` tag in the `` element. Does the page display correctly in your browser? If so, what does your browser say the page's encoding is? If you still need help, please provide the information @Johannes requested. – Alan Moore Aug 07 '10 at 23:15
  • The root cause of this particular problem is that he read the HTML output using the wrong encoding (or the stdout/whatever log-console is using the wrong encoding to display results) and that he thought it's caused by regex. Using an encoding-aware HTML parser will fix the problem and more. Extracting HTML using regex in turn is never going to be reliably possible in an easy manner. – BalusC Aug 07 '10 at 23:39
  • The http-headers also specifies ISO-8859-1 and yes, it displays correctly in there. See my comment for BalusC's answer for more info on the problem. – monoceres Aug 08 '10 at 01:07

2 Answers2

3

This is probably the immediate cause of your problem, and it's definitely an error:

builder.append(new String(tmp), 0, read-1);

When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.

But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.

In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • +1. Nice one, didn't even know about this. So even with platforms that normally use Unicode only there is some margin of serious screwup. – Joey Aug 11 '10 at 11:46
2

It's html from a website.

Use a HTML parser and this problem and all future potential problems will disappear.

I can recommend picking Jsoup for the job.

See also:

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • I should have added it form the beginning, but I thought it was irrelevant, but the platform is android. Which makes using standard 3d-party java libraries more difficult. And yes, I know that parsing html with regex is bad, and I will probably move on to a html parser that I can use with android. – monoceres Aug 08 '10 at 01:10