0

Pulling images from a url when I ran into something I never had before. The header check returned a 403 error and although the images extensions were listed as .jpg they were returned as a application/octet-stream, and checking the content type returned text/html.

I have read the 403 "typically" is to prevent screen scrapping, but this is just on the images.

I found it odd that I could view the source of the web page, see the image src, and click on it and return the image to the browser, but not via code.

Is there a way to convert the image url into an actual image? I eventually want to pull height, width, size info from the images and save them to a folder on my server.

$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag){

$image_src = $tag->getAttribute('src');

echo get_headers($image_src, 1); //returns a 403 Forbidden Error

echo image_type_to_mime_type(exif_imagetype($image_src)); //returns application/octet-stream

$i = getimagesize($image_src);
var_dump($i); //returns bool(false)

$c = curl_init();
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'HEAD');
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_NOBODY, true);
curl_setopt($c, CURLOPT_URL, $image_src);
curl_exec($c);
echo $content_type = curl_getinfo($c, CURLINFO_CONTENT_TYPE); //returns text/html

}
Kurt Marshman
  • 215
  • 2
  • 16

1 Answers1

0

In my experience when dealing with images getting application/octet-stream when you expect to have a mime type of image/jpeg, image/png, etc. is due to the script not being able to process the image correctly, due to incorrect PHP config. (For example having an image bigger than the max file upload or post size gives a mime of octet-stream)

Using file_get_contents() on a url, you will need to ensure that allow_url_fopen is enabled, so that fopen is allowed to get the contents of a URL as though it were a local file. (PHP INI allow_url_fopen)

Alternatively look at using cURL to download the url and go from there (Look at this answer for a way of doing this). Try both of the config change and the cURL process to see if they yield the same results.

However the fact you are getting a 403 error sounds like it is something on the remote side that is not allowing you to retrieve the images through your specific request. As you correctly identified this could be a security attempt to stop scraping. Have you tried using a different website to grab the images from, or a server that is under your control?

Hope something here helps :)

Community
  • 1
  • 1
mrjpsycho
  • 1
  • 1
  • I have used the script 100's of times on servers that I do and do not control and this is the first time I have run into this issue. – Kurt Marshman Dec 01 '14 at 23:24
  • Hmm interesting... care to share the URL? what happens if you simply do a cURL to the image source, does that also respond with a 403? – mrjpsycho Dec 01 '14 at 23:29
  • Sorry @timclutton I never seen your response. The url to the site I was testing is http://www.windycitydubfest.com/ url the image is img src="http://www.windycitydubfest.com/wp-content/uploads/2014/11/DSC_0013-e1415224501159.jpg" – Kurt Marshman Dec 01 '14 at 23:33
  • I dunno about you, I get a 403 when just navigating to this image via chrome – mrjpsycho Dec 01 '14 at 23:35
  • The site is not mine so I do not have access to the server per se. The site is owned by a friend of mine. – Kurt Marshman Dec 01 '14 at 23:36
  • Strangely when I go direct to the image link you provided it throws me a 403 forbidden error. Then if I remove the file name and go to http://www.windycitydubfest.com/wp-content/uploads/2014/11/ I get an empty directory listing. If I then go up a level again and back in I get the list of images, and can load the image. – mrjpsycho Dec 01 '14 at 23:42
  • http://www.windycitydubfest.com/wp-content/uploads/2014/11/DSC_0013-e1415224501159.jpg – Kurt Marshman Dec 01 '14 at 23:42
  • Yeh I gathered it was a markdown issue adding a " to the end. I've not seen behaviour like this before. Trying it in incognito gives a 403, then going to uploads/2014, clicking on 11, then on the file allows me to see the image – mrjpsycho Dec 01 '14 at 23:45
  • I have never seen it either so I have no idea how to attempt a php or curl call to solve this. Everything I have tried has not returned the result I am looking for. – Kurt Marshman Dec 01 '14 at 23:48
  • Unfortunately it seems its a server side issue, something to do with the way it serves directory listings and files for wp-content, only allowing a specific way to access files. – mrjpsycho Dec 02 '14 at 00:08