0

In my application I need to parse a website and save some data from ir to the database. I am using HttpClient to get the page content. My code looks like this:

        HttpClient client = new DefaultHttpClient();
        System.out.println(doc.getUrl());
        HttpGet contentGet= new HttpGet(siteUrl + personUrl);
        HttpResponse response = client.execute(contentGet);

        String html =  convertStreamToString(response.getEntity().getContent());

       /*
          parse the page
       */

    /***********************************************************************/

    public static String convertStreamToString(InputStream is) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    StringBuilder sb = new StringBuilder();
    String line = null;
    while ((line = reader.readLine()) != null) {
      sb.append(line + "\n");
    }
    is.close();
    return sb.toString();
}

I am doing this in a loop - I try to get content of some pages (their structure is the same). Sometimes it works fine, but unfortunately, my response in many cases is a sequence of similar trash liek this:

�=�v7���9�Hdz$�d7/�$�st��؎I��X^�$A6t_D���!gr�����C^��k@��MQ�2�d�8�]

I I don't know where is the problem, please help me.


I have displayed headers of all responses that I got. For correct ones, there are:

Server : nginx/1.0.13
Date : Sat, 23 Mar 2013 21:50:31 GMT
Content-Type : text/html; charset=utf-8
Transfer-Encoding : chunked
Connection : close
Vary : Accept-Encoding
Expires : Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control : no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma : no-cache
Set-Cookie : pfSC=1; path=/; domain=.profeo.pl
Set-Cookie : pfSCvp=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/; domain=.profeo.pl

For incorrect ones:

Server : nginx/1.2.4
Date : Sat, 23 Mar 2013 21:50:33 GMT
Content-Type : text/html
Transfer-Encoding : chunked
Connection : close
Set-Cookie : pfSCvp=3cff2422fd8f9b6e57e858d3883f4eaf; path=/; domain=.profeo.pl
Content-Encoding : gzip

Any other suggestions? My guess is that this gzip encoding is a problem here, but what can I do about it?

user1315305
  • 1,329
  • 2
  • 11
  • 20

2 Answers2

2

This probably has to do with some websites using a different character encoding in their response than your JVM default. To convert from a raw byte stream, like those provided by InputStreams, to a character stream (or a String), you have to choose a character encoding. HTTP responses can use different encodings, but they'll typically tell you what encoding they're using. You could do this manually by finding the "Content-Encoding" header of the HttpResponse, but your library provides a utility for doing this, since it's a common need. It's found in the EntityUtils class, and you can use it like so:

String html = EntityUtils.toString(response.getEntity());

You'll have to add

import org.apache.http.util.EntityUtils;

to the top of your file for that to work.

If that doesn't help, another possibility is that some of the URLs you're retrieving are binary, not textual, in which case the things you're trying to do don't make sense. If that's the case, you can possibly try to distinguish between the textual responses and the binary responses by checking Content-Type header, like so:

boolean isTextual = response.getFirstHeader("Content-Type").getValue().startsWith("text");

NEW MATERIAL:

After looking at the HTTP headers you added to your question, my best guess is that this is being caused by gzip compression of the responses. You can find more info on how to deal with that in this question, but the short version is that you should try using ContentEncodingHttpClient instead of DefaultHttpClient.

Another edit: ContentEncodingHttpClient is now deprecated, and you're supposed to use DecompressingHttpClient instead.

Community
  • 1
  • 1
gsteff
  • 4,764
  • 1
  • 20
  • 17
  • I used EntityUtils as you suggested and run it for 20 similar pages. I also displayed value of isTextual variable. For 2 of 20 pages response was correct HTML, for other 18 I received trash once again. For all of them, value of isTextual was true. For example - page http://profeo.pl/piotr-grzes was received successfully, and http://profeo.pl/annais wasn't. I have no idea what's wrong, these pages are practically identical. – user1315305 Mar 23 '13 at 21:00
  • Thank you so much! I spent so much time trying to figure it out, now it finally works! – user1315305 Mar 23 '13 at 22:49
0

You need a httpclient which don't use compression. I use this HttpClientBuilder.create().disableContentCompression().build() httpclient

laaposto
  • 11,835
  • 15
  • 54
  • 71