2

I used to scrap a website for information using the file_get_contents command in PHP. Although now every time I try to go scrap the webpage it only returns

<html><head><meta http-equiv="Refresh" content="0; URL=http://website.com/latest.php?ckattempt=1"></head><body></body></html>

This was the code that I had used that used to work

$opts = array(
    'http'=>array(
        'method'=>"GET",
        'header'=>"Accept-language: en\r\n".
                  "Referer: ".$url."/index.php".
                  "Cookie: id=<id token>; auth=<auth token>;"
    )
);
$context = stream_context_create($opts);
$html = file_get_contents($url.'/latest.php?ckattempt=0', false, $context);

I am assuming that it has to do with something dealing with the refresh meta tag, but does anyone know of any ways I could get around this by chance so I can scrap the webpage again?

Andrew Butler
  • 1,060
  • 2
  • 14
  • 31

1 Answers1

1

If i interpret your question correctly, your problem stems from the fact that on the target server the site you usually loaded has changed. Instead of the old page, the page you are loading is now using a meta tag (called meta refresh) to forward the client to another page (to http://website.com/latest.php?ckattempt=1 in this particular example).

Read about meta refresh here

What you need to do (in order to get to the data you'd like to read) is probably to follow that link, which means that you should load the URL provided in that meta tag and read the data from there.

CURL can follow redirects but i am not entirely sure it will follow a meta tag, as this is a rather revoked method of forwarding and i don't remember CURL as spending an awful lot of time parsing incoming HTML code (not at all actually).

Use of meta refresh is discouraged by the World Wide Web Consortium (W3C)

Your best option in the given case is to parse the incoming data, pick out the desired information (which is the URL) and load that url instead.

You could do this using regex. See this question about which regex to use to detect a link in a string.

Abstract steps:

  • Load page using your common file_get_contents() call
  • Parse the incoming page and see if it contains a meta tag with the http-equiv attribute set to refresh
  • If you find this tag, pass the contents you received to a function which extracts the target URL
  • Use file_get_contents() on that target URL to get the data you aim for
Community
  • 1
  • 1
SquareCat
  • 5,699
  • 9
  • 41
  • 75
  • I get what you're saying, but it's just odd, because the url that is provided in the meta tag is the same URL that I am using in the first place? – Andrew Butler Dec 18 '13 at 21:31
  • Don't know much on the subject, but is it possible that the site is using the meta refresh tag to redirect people for the explicit purpose of avoiding scrapers? – Martin Sheeks Dec 18 '13 at 21:33
  • Then perhaps the `ckattempt` parameter is used to determine whether its an 'attempt' to something. Cannot be sure, but can try to play around with that parameter and see what happens. You might also want to try what happens if you explicitly change the parameter to zero (0). – SquareCat Dec 18 '13 at 21:34
  • I'm pretty sure that is what the ckattempt variable is for, just trying to figure out a way around it lol – Andrew Butler Dec 18 '13 at 21:35
  • Check the headers that are sent back with the page that contains the meta tag. Perhaps some cookie is being set. If so, it is possible that perhaps without this cookie you will just keep increasing the `ckattempt` parameter on every load. The term `attempt` appears to be rather suspicious in this particular context. – SquareCat Dec 18 '13 at 21:36
  • CURL will not make you very happy here, because i don't think its ever going to parse or actually react to the `meta` tag. CURL will react to proper forwards placed in a proper HTTP header (Location). See [This question](http://stackoverflow.com/questions/1820705/php-can-curl-follow-meta-redirects) for details. – SquareCat Dec 18 '13 at 21:48