2

I have written simple to code to get the content-type of a given URL. To make the processing faster, I made a change to set the request method as HEAD

// Added a random puppy face picture here 
// On entering this query in browser (or Poster<mozilla> or Postman<chrome>), the
// content type is shown as image/jpeg

URL url = new URL("http://www.bubblews.com/assets/images/news/521013543_1385596410.jpg");    

HttpURLConnection connection = (HttpURLConnection) url
        .openConnection();
connection.setRequestMethod("HEAD");
connection.connect();
String contentType = connection.getContentType();
System.out.println(contentType);
if (!contentType.contains("text/html")) {
    System.out.println("NOT TEXT/HTML");
    // Do something
}

I am trying to achieve something if it is not text/html, but when I set the request method as HEAD, the content-type is shown as text/html. If I fire the same HEAD request using Poster or Postman, I see the content-type as image/jpeg.

So what is it that makes the content-type change in case of this Java code?. Can someone please point out any mistake that I may have made?

Note: I used this post as reference

Community
  • 1
  • 1
Some guy
  • 1,210
  • 1
  • 17
  • 39
  • I suppose your getting an HTML page which says "method not allowed" or some other error. You should probably add an "Accept" header and "User-Agent" header. – hgoebl Feb 13 '14 at 15:01
  • @hgoebl well, in that case, shouldn't it have `not` given `image/jpeg` when tested using `poster/postman`? – Some guy Feb 13 '14 at 15:04
  • I'm not sure how many headers Postman is adding to your request which are not explicitly set by you. I suppose 'User-Agent' and 'Accept' could be one of them. Can you sniff the traffic (Fiddler, Wireshark)? – hgoebl Feb 13 '14 at 16:06
  • @hgoebl Thanks a lot, adding a User-Agent property to the request header solved the problem. Can you add that as an answer so that I can accept it!. – Some guy Feb 14 '14 at 05:35

1 Answers1

1

You should probably add an Accept header and/or User-Agent header.

Most web servers deliver different content depending on headers set by the client (e.g. web browser, Java HttpURLConnection, curl, ...). This is especially true for Accept, Accept-Encoding, Accept-Language, User-Agent, Cookie and Referer.

As an example, a web-server might refuse to deliver an image, if the Referer header does not link to an internal page. In your case, the web-server doesn't deliver images if it seems like some robot is crawling it. So if you fake your request like if it's coming from a web-browser, the server might deliver it.

When crawling web-sites, you should respect robots.txt (because you act like a robot). So strictly speaking you should be careful when faking User-Agent when doing a lot of requests or create a big business out of this. I don't know how big web-sites react on such behavior, especially when someone by-passes there business...

Please don't see this as a telling-off. I just wanted to point you to this, so you don't run into trouble. Maybe it's not a problem at all, YMMV.

hgoebl
  • 12,637
  • 9
  • 49
  • 72
  • I was just testing around with java.net.* package. Just out of curiosity, when I place a `HEAD` request, why would the web-server even have to 'think' about delivering images. Isnt `HEAD` supposed to be only for headers?. Or is it, as you say, to 'protect' its businesses? – Some guy Feb 14 '14 at 07:11
  • I think you're right. HEAD requests shouldn't do any harm. But most implementations of dynamic content don't have extra logic for HEAD requests, they just don't send the content. In practice `If-Modified-Since` header (and the like) are used more frequently than HEAD requests. BTW it would be very interesting how the response looked like when you expected an image-type and got text/html. – hgoebl Feb 14 '14 at 07:48
  • I did look into the complete header after your point that it might be blocked for `automated crawling`. Infact, the response was `403 forbidden` and this message is actually `text/html`. It all makes sense now, thanks for pointing it out! – Some guy Feb 14 '14 at 09:15
  • Adding to that, the User-Agent when requested through the code was `User-Agent : Java/1.7.0_51` – Some guy Feb 14 '14 at 09:19