1

I have an application that downloads favicon files. Recently I switched from using file_get_contents to curlExec because it has a higher success rate.

However when I try and download from www.prisonexp.org , instead of sending the actual file data they send the text Forbidden. Normally, I would see a bunch of binary data converted to ASCII in this field in the test script.

I find this strange because I can just browse to the file in the browser and download it manually.

Is this valid? Or am I missing something? How are they preventing a download one way but not another. To see the test script in action go here.

Test Script

As a side, how can I detect when instead of sending binary data, a server sends a message as text? I could just check for "Forbidden" but I'm not sure if this is a standard response.

Research / Update

Download Methods

Community
  • 1
  • 1
cade galt
  • 3,843
  • 8
  • 32
  • 48
  • They probably check the `User-Agent` header and blacklist a known set of user agents, such as wget and curl. – Dark Falcon Jun 23 '15 at 14:04
  • I am able to use `wget http://www.prisonexp.org/favicon.ico` to download the file successfully, so the error may well be in your code. Please post your actual code so that we can have a look at it. – DOOManiac Jun 23 '15 at 14:19
  • I've run this script hundreds of time successfully, and I've found this to be the first site/domain that does this. Strange practice. – cade galt Jun 23 '15 at 14:20
  • On a side note, not every site uses `favicon.ico` as their Favicon! The favicon can be manually specified on the page, so if you're going for 100% accuracy you may need to parse the actual HTML as well! http://stackoverflow.com/questions/5691582/what-if-the-favicon-is-not-named-favicon-ico – DOOManiac Jun 23 '15 at 14:22
  • Can anyone take a gander as to why one way has more access than the other, I mean why was that decision made by the architects ? I'm going to use `wget` and get programmatically any ways. – cade galt Jul 07 '15 at 12:09

1 Answers1

0

Things that can be done programmatically have to be throttled back or servers could become overwhelmed.

Reading the user-agent is one way this is done.

Wget might have more access as it is commonly used as a command line tool. Similarly images delivered to your client you will have access to directly.