1

I have made a simple web Crawler with PHP cURL that should grab all the images of a particular page from Amazon where the keyword samsung has been searched.

Here is the code:

$curl = curl_init(); // $curl is going to be data type curl resource

$search_string = "samsung";

$url = "https://www.amazon.com/s?k$search_string";

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable 

$result = curl_exec($curl);

preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);

print_r($matches);

curl_close($curl);

But now I get Null array:

Array ( [0] => Array ( ) )

I don't why it is showing that, so if you know what is going wrong or how can I handle this, please let me know, I would really appreciate any idea from you guys...

Thanks in advance.

Note that I have specified [^\s]*? regular expression instead of image name to load all the available images on web page.

UPDATE #1:

Results of curl --head https://www.amazon.com/s?k=samsung

HTTP/1.1 503 Service Unavailable
Content-Type: text/html
Content-Length: 2671
Connection: keep-alive
Server: Server
Date: Tue, 15 Jun 2021 20:59:38 GMT
x-amz-rid: 9BVX8KQMWJ4QDJ75ETYV
Vary: Content-Type,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
Last-Modified: Fri, 14 May 2021 19:08:48 GMT
ETag: "a6f-5c24ef9383000"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=47474747; includeSubDomains; preload
Permissions-Policy: interest-cohort=()
X-Cache: Error from cloudfront
Via: 1.1 5345148f0ba8ae3c67b69d035acdbfc5.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: AMS50-C1
X-Amz-Cf-Id: AHdq2-QLEtCE4WvXZIEh_P75D8hCrHP09EAkNqBer5VBS-pI-blj1w==
aynber
  • 22,380
  • 8
  • 50
  • 63
  • 1
    Most likely the response to your request is a redirect or rewrite, and does not include what you're looking for. – Don R Jun 15 '21 at 20:40
  • @DonR So what should I do to fix that –  Jun 15 '21 at 20:42
  • You'd have to process the redirect and request the new resource, as a browser would. – Don R Jun 15 '21 at 20:43
  • @DonR Can you explain more or give me a link to a an example please –  Jun 15 '21 at 20:44
  • step 1: open a terminal/cmd and just _run curl_ to see what it gives back for your URL. But you probably want to fix that typo in the URL first (stick in an `echo $url;` to see what's wrong with it). – Mike 'Pomax' Kamermans Jun 15 '21 at 20:47
  • @Mike'Pomax'Kamermans I did that, and put an **UPDATE #1** on results at the post –  Jun 15 '21 at 21:01
  • The `503 Service Unavailable` may indicate that your IP has been blacklisted by Amazon for scraping. This blacklisting is often done for requests without an up-to-date User-Agent – MaartenDev Jun 15 '21 at 21:03
  • @MaartenDev I don't get any error, just a *NULL* result ! –  Jun 15 '21 at 21:07
  • Except you do, as per the headers that "real" curl shows: `HTTP/1.1 503 Service Unavailable`. Also, don't turn off ssl verification, keep it on, it should work just fine if you're trying to access a normal URL. – Mike 'Pomax' Kamermans Jun 15 '21 at 21:46

4 Answers4

3

First issue: Your code:

$url = "https://www.amazon.com/s?k$search_string";

should be (note the "=")

$url = "https://www.amazon.com/s?k=$search_string";

Second issue: Amazon is smart, they're not going to let you scrape as you will. The result is the content for:

enter image description here

You can see this with:

$result = curl_exec($curl);
var_dump($result);

Third issue: Regex is not working. One should test Regex at https://www.phpliveregex.com/#tab-preg-match-all (Using a right-click > view source, copy and paste of the page content.)

From what I got your regex did not return any results, but this did: https://m.media-amazon.com/images/I/[^\s]*?.jpg

May be that the string bit ._AC_UL320_ is also a Amazon anti-scraping thing... :(

  • You can scrape AWS provided you've simulated a real user well enough. I'd assume OP's UserAgent is still `cURL/PHP` or similar which would obviously be red-flagged. Guzzle with a CookieJar helps a lot as it retains cookies previously set where their absence would trigger a similar captcha – zanderwar Jun 16 '21 at 00:22
  • you trigger the anti-scraping thing by not having a browser-like `Accept` header, and by not having a User-Agent, interestingly curl/x.x.x is explicitly a BLACKLISTED user-agent, but "libcurl" (a non-standard ua string) is not blacklisted, you can get a proper result by running ```curl 'https://www.amazon.com/s?k=samsung' --compressed -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' --user-agent 'libcurl'``` – hanshenrik Jun 21 '21 at 05:32
0

it's not https://www.amazon.com/s?k$search_string, it's supposed to be 'https://www.amazon.com/s?k='.urlencode($search_string);, also Amazon.com requires you to send a Accept-Encoding header, otherwise you'll risk getting gzip-compressed responses with nothing to decompress it which means you need a CURLOPT_ENCODING, also amazon will block you if you don't supply a User-Agent header, so you must supply a CURLOPT_USERAGENT, also Amazon will block you without a browser-like Accept header, so you need CURLOPT_HTTPHEADER => array('accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng')

also Do not parse html with regex, Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. Instead use a HTML parser like DOMDocument

this code

<?php
$curl = curl_init(); // $curl is going to be data type curl resource

$search_string = "samsung";

$url = "https://www.amazon.com/s?k=".urlencode($search_string);

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable 
curl_setopt_array($curl,array(
    CURLOPT_ENCODING =>'',
    CURLOPT_USERAGENT=>'libcurl',
    CURLOPT_HTTPHEADER=>array(
        'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',

    )
));
$html=curl_exec($curl);
$domd = new DOMDocument();
@$domd->loadHTML($html);
foreach($domd->getElementsByTagName("img") as $img){
    echo $img->getAttribute("src"),"\n";
}

outputs

//fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:136-7756522-9160852:777GSTVR1XJ9MBF1N0KN$uedata=s:%2Frd%2Fuedata%3Fstaticb%26id%3D777GSTVR1XJ9MBF1N0KN:0
https://images-na.ssl-images-amazon.com/images/G/01/gno/sprites/nav-sprite-global-1x-hm-dsk-reorg._CB405937547_.png
https://m.media-amazon.com/images/I/81HdcaHSq4L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/91eAcgt9fSL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81afsli5ctL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61m1Dot5KCL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61HFJwSDQ4L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/216-OX9rBaL._SS72_.png
https://m.media-amazon.com/images/I/21OXy0oJ8VL._SS160_.png
https://m.media-amazon.com/images/I/61jfI8GyQgL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61LUNEgB6iL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/813dec-cszS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81AT+Flc+EL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/216-OX9rBaL._SS72_.png
https://m.media-amazon.com/images/I/21OXy0oJ8VL._SS160_.png
https://m.media-amazon.com/images/I/61a5ejk6K2L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81+3SWSAhDL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61pwE8H34zL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71ejkOW4y2L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71G6eW8H8hL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/91dFUw5MUTS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81P4RzFnw6L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/712iry8nIYL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61VgW9ZZXiL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61ft-L7HnUL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51icdppvRVL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/6164p9jY2jS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51skvShlcsL._AC_UY218_.jpg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/93913ead-ae42-4933-8fc4-e9f88b0396c9/1635f47b-1fa9-40ca-8d85-47f529c1ba8b/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/6aa489c6-af9d-48d0-94c8-cce1a4f50fc7/ff2a7805-3166-41b9-9881-d00901ca9dfd/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/73b89b9f-ee28-446f-8535-beacd328c95a/8caa5478-3583-49f9-9dcb-6e5b0a254fa6/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/457fd8ad-f566-4682-bb66-fd865954aec0/fb2cdc76-7ed6-4b86-9196-d40c3ead2914/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/5c60fcd5-17c1-4389-8423-2252436f21c8/0125e72d-9178-4048-bea3-9d268a406a05/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/f852e5ab-0fa9-4f91-b195-b0facc4d0d70/30b0ec08-79b2-428d-98df-aadffd2c00eb/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/d173de56-5162-463f-be97-d256c1895024/7974c773-0c53-43a1-bfb4-91d7cc3ce801/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/2cfe5e10-6a7e-43f4-80c7-d87f212b8007/43e8a030-58c5-491a-9854-cd4d8824a873/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/personalization/ybh/loading-4x-gray._CB485916920_.gif
https://assoc-na.associates-amazon.com/abid/um?s=136-7756522-9160852&m=ATVPDKIKX0DER
//fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:136-7756522-9160852:777GSTVR1XJ9MBF1N0KN$uedata=s:%2Frd%2Fuedata%3Fnoscript%26id%3D777GSTVR1XJ9MBF1N0KN:0
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
0

$url = "https://www.amazon.com/s?k$search_string"; yes your url is wrong Actull url is.you can try

$url = "https://www.amazon.com/s?k=$search_string";

0

Firstly there is a typo change

$url = "https://www.amazon.com/s?k".$search_string;

to

$url = "https://www.amazon.com/s?k=".$search_string;

Amazon expects some header values to be there when requesting content please refer to the following curl request

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); 
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.3>
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
    'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v>
 ));
 curl_setopt($curl, CURLOPT_ENCODING, ''); 

 $result=curl_exec($curl);

Lastly, Change your preg_match_all function from

 preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);

To

 preg_match_all('/(https?:\/\/\S+\.(?:jpg|png|gif))\s+/', $result, $matches);

Complete Code :

<?php

$curl = curl_init();
$search_string = "samsung";

$url = "https://www.amazon.com/s?k=".$search_string;

//set headers to match with amazon header . you can check headers with any browsers developer tool.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); 
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36');
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
        'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
));
curl_setopt($curl, CURLOPT_ENCODING, ''); 

$result=curl_exec($curl);

preg_match_all('/(https?:\/\/\S+\.(?:jpg|png|gif))\s+/', $result, $matches);

print_r($matches);
Rohit Yadav
  • 47
  • 1
  • 6