3

I have to get the final redirect url from the this: https://web.archive.org/web/20070701005218/http://www.maladnews.com/ which actually redirects to this: https://web.archive.org/web/20080109064420/http://www.maladnews.com/Site%203/Malad%20City%20&%20Oneida%20County%20News/Malad%20City%20&%20Oneida%20County%20News.html

I tried the answers from other stackoverflow answers which works for other websites but not for the above link.

I've tried to extract regular location header:

if(preg_match('#Location: (.*)#', $html, $m))
 $l = trim($m[1]);

and also checked the javascript way:

preg_match("/window\.location\.replace\('(.*?)'\)/", $html, $m) ? $m[1] : null;

Please help!

Acidon
  • 1,294
  • 4
  • 23
  • 44

1 Answers1

11

Use curl_getinfo() with CURLINFO_REDIRECT_URL or CURLINFO_EFFECTIVE_URL depending on your use case.

CURLINFO_REDIRECT_URL - With the CURLOPT_FOLLOWLOCATION option disabled: redirect URL found in the last transaction, that should be requested manually next. With the CURLOPT_FOLLOWLOCATION option enabled: this is empty. The redirect URL in this case is available in CURLINFO_EFFECTIVE_URL

-- http://php.net/manual/en/function.curl-getinfo.php

Example:

<?php
$url = 'https://google.com/';

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

$html = curl_exec($ch);

$redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

curl_close($ch);

echo "Original URL:   " . $url . "\n";
echo "Redirected URL: " . $redirectedUrl . "\n";

When I run this code, the output is:

Original URL:   https://google.com/
Redirected URL: https://www.google.com/
Asaph
  • 159,146
  • 25
  • 197
  • 199
  • I tried that before, and just tried it again [here](http://ideone.com/p7bQSM) but for some reason it outputs initial url... can you please modify the code to get it working? – Acidon Jan 27 '16 at 20:43
  • There are 2 things wrong with the code you linked. 1) The url you are testing with is invalid. (You can change it to `https://google.com/`, which redirects to `https://www.google.com/`.) 2) The call to `curl_getinfo($ch,CURLINFO_EFFECTIVE_URL)` needs to be moved immediately _after_ the call to `curl_exec($ch)` because at the point you're calling it, the redirect hasn't been followed yet. – Asaph Jan 27 '16 at 21:52
  • Thanks for pointing out, however url I am testing with is valid since its the url I am runing curl on (its also mentioned in question description), Also I've experimented with different getinfo call placement but it still returns the same result... Looks like that the redirect is done via Javascript or by some refresh meta tags a few seconds after initial page is loaded and curl doesn't follow the page. Please check the links provided in question. – Acidon Jan 27 '16 at 22:28
  • The invalid url I was referring to is the one [here](http://ideone.com/p7bQSM), which is definitely not right. It has a bunch of dot, dot, dots in the domain name. Also, curl doesn't parse any javascript. You'll only be able to use curl to follow HTTP level redirects. I will update my code with a working curl only example to illustrate the HTTP redirect functionality. – Asaph Jan 27 '16 at 22:33
  • I see, actually `https://w...content-available-to-author-only...e.org/web/20070701005218/http://w...content-available-to-author-only...s.com/` its the way that site (http://ideone.com/) masks urls in php script and its only availiable to post author, but it doesn't affect script work (you can rerun it by forking/editing it). That site is just a php sandbox. If your example cannot work for url from question description, I cannot accept it. Like the question states, I've tried different solutions before, including the one you provided, but I am looking for solution for that particular url – Acidon Jan 27 '16 at 22:43
  • Ok, nevermind ideone.com and its mangled urls. You would probably have gotten a quicker answer on StackOverflow by including sample code directly in your question here, instead of burying a link to ideone.com in a comment. In any case, I have included a working example in my answer. Please run that and verify that it works. – Asaph Jan 27 '16 at 22:50
  • The url from this question, namely `https://web.archive.org/web/20070701005218/http://www.maladnews.com/`, doesn't do an HTTP redirect at all. So curl is not the right tool for the job. If it does a javascript `window.location` assignment, you can try to parse that out of the markup, but this would be a fragile solution. – Asaph Jan 27 '16 at 22:54
  • This worked 100% for us! We have media a company gives us and the URLs are always wrong - we needed to get the right URLs... – Scott Aug 25 '20 at 20:05