0

I'm downloading website's source code using HttpClient and then I want to extract some data using regular expressions. Unfortunetely the website is encoded in iso-8859-1 which seems to be causing problems. Here's the sample code to download website:

HttpGet query = new HttpGet(url);
HttpResponse queryResponse = httpClient.execute(query);
String queryText = EntityUtils.toString(queryResponse.getEntity()).replaceAll("\r", " ").replaceAll("\n", " ");

And then the expression:

Pattern patter = Pattern.compile("<p class=\"qt\">(.*?)</p>");
Matcher matcher = pattern.matcher(queryText);
while (matcher.find()) // do something

The problem is that it's missing some occurences, when there are special iso-8859-1 characters. (.*?) doesn't seem to match them. What's the reason of this problem? How do I fix it?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Sebastian Nowak
  • 5,607
  • 8
  • 67
  • 107
  • That whole "I want to use regex" is the first mistake; would you consider just using something like [jsoup](http://jsoup.org/) or [tagsoup](http://ccil.org/~cowan/XML/tagsoup/) instead? Otherwise [this could be you](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Dave Newton Oct 28 '11 at 16:08

1 Answers1

1

Are you sure this has to do with "special iso-8859-1 characters" and not newlines? . does not match line terminators by default. You can use the DOTALL flag to enable matching of line terminators as well. eg:

Pattern patter = Pattern.compile("<p class=\"qt\">(.*?)</p>", Pattern.DOTALL);
Laurence Gonsalves
  • 137,896
  • 35
  • 246
  • 299
  • New line characters \n and \r are removed as you can see in the first code snippet. Surprisingly the flag you've mentioned caused the regex to match those special characters, so it solved the problem. Thanks! – Sebastian Nowak Oct 28 '11 at 16:13
  • 1
    I actually hadn't noticed the `replaceAll` in the earlier line, but there are other line terminators than `\n` and '\r' (eg: `\v` and `\f`). I have frequently had bugs where `.` wasn't matching everything I wanted it to and every time it was because of a missing `DOTALL`. – Laurence Gonsalves Oct 28 '11 at 16:20