0

I want to use curl to scrape multiple pages of an online shop. The problem that i have is that the urls are seo friendly - or something like that - and they look like this:

https://shopname.com/product-id-title-of-a-product.html

If i use the entire url it works and i'm able to get the data that i'm looking for but the only variable in that title that i know is the ID:

https://shopname.com/product-294

Is there a way to scrape that url in this case?

The url that only has the ID in it does REDIRECT to the full url.

And this is the code that i'm using:

$curl = curl_init();
$url = 'https://shopname.com/product-294';

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$result = curl_exec($curl);
CJ Dennis
  • 4,226
  • 2
  • 40
  • 69
emma
  • 761
  • 5
  • 20
  • Possibly no. .. – Sujit Agarwal Aug 10 '18 at 10:53
  • 1
    If that shop system does not deliver the content under that URL, and does not redirect to the full URL either … then no. Is this your shop? – CBroe Aug 10 '18 at 10:56
  • Hey CBroe, the url does redirect me to the full url and no, and yes this is mine but i want to learn curl not query a db :D X_X – emma Aug 10 '18 at 10:57
  • Hey Philipp, my probm is that if i insert a full url it works but with only the id in it doesn't but in the same time if i put that url with only the id in it in my browser and press enter it does redirect me X_X – emma Aug 10 '18 at 11:03
  • @emma What does `curl_getinfo($curl, CURLINFO_RESPONSE_CODE);` return, if you execute it after you executed your curl call? 302? If so, what does `curl_getinfo($curl, CURLINFO_REDIRECT_URL);` return afterwards? – Philipp Maurer Aug 10 '18 at 11:07
  • @PhilippMaurer `curl_getinfo($curl, CURLINFO_RESPONSE_CODE);` returns `301` and `curl_getinfo($curl, CURLINFO_REDIRECT_URL);` OMG it returns the actual url :O!!!!! THANK YOUUUUUU! <3 :D But now how is it ok to reexecute that curl method? – emma Aug 10 '18 at 11:13
  • 1
    @emma It works as stated in the answer i just posted – Philipp Maurer Aug 10 '18 at 11:16

2 Answers2

4

Curl provides the option CURLOPT_FOLLOWLOCATION.

curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

The documentation states:

TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).

Therefore it would be advisable to set CURLOPT_MAXREDIRS aswell, for example to limit the execution to 1 redirection:

curl_setopt($curl, CURLOPT_MAXREDIRS, 1);

Like this you should be automatically be redirected to the original url without any further programming.

Philipp Maurer
  • 2,480
  • 6
  • 18
  • 25
  • There are know problems with `CURLOPT_FOLLOWLOCATION`, which' s solutions are documented in [this stackoverflow question](https://stackoverflow.com/q/2511410/8913537). – Philipp Maurer Aug 10 '18 at 11:22
  • now it works, thank you sooooooo much for your time! Now i have only one more question, is it ok to use curl in a foreach if i have an array with 10 product id's that i want to scrap? :-s – emma Aug 10 '18 at 11:23
  • 1
    @emma [This stackoverflow answer](https://stackoverflow.com/a/18047230/8913537) should answer your question. – Philipp Maurer Aug 10 '18 at 11:36
  • <3 Thank you again and again and again! :D – emma Aug 10 '18 at 12:22
2

I think you need to capture the response headers in the curl object, that should contain the redirect url within them, and then you can parse that out and do a second curl request to get the url you are after. Try using an app like postman or insomnia to assist you in this process.

noid
  • 106
  • 6
  • Hey noid, isn't there a way to write that code myself? #-o i mean could you please tell me a little bit more detaild about what i should look for? X_X – emma Aug 10 '18 at 11:05
  • i have only done it in SOAP before, not Curl, so you would need to set something like this to get you started: https://stackoverflow.com/questions/9183178/can-php-curl-retrieve-response-headers-and-body-in-a-single-request – noid Aug 10 '18 at 11:10
  • Phillipp Maurer has a much better answer than I have. – noid Aug 10 '18 at 11:21
  • 2
    thank you for taking time to answer to my question! Thanks a lot :D – emma Aug 10 '18 at 11:25
  • Having experienced the same thing myself with soap, I was sure that I could offer that experience as a possible solution, turns out Curl has it built in. :) – noid Aug 10 '18 at 11:30