4

I am scraping websites using the FriendsOfPHP/Goutte package. Everything works great. I'm scraping the sites for open graph tags like image, title, etc., when a user pastes a URL into an input.

The problem occurs when a user copies the URL from a mobile device, the URL is now a mobile URL, like https://m.datpiff.com/tape/818948, and on that URL there are no open-graph tags.

When I access the same URL and replace the sub-domain m with www e.g. https://www.datpiff.com/tape/818948 from a desktop, it redirects me to: http://www.datpiff.com/Chance-The-Rapper-Jeremih-Merry-Christmas-Lil-Mama-mixtape.818948.html.

and this desktop URL does contain open-graph tags.

Is there a way I can get my server to force or trick the receiving server to redirect all URLs to the desktop version, so that I can use the open graph tags? The receiving server is already redirecting to the proper URL, but only if I'm typing directly from a browser on a desktop.

Here's the code I'm using - it works great. I just need to be able to redirect the URL I'm scraping to the desktop version.

First I'm replacing the m with www in my js like so:

fullurl.replace('m.',"www");

that converts https://m.datpiff.com/tape/818948 into https://www.datpiff.com/tape/818948

then in my PHP code i'm using something like this:

$url_to_scrape = $urltoscrape;
    $client = new Client();

    // Go to the example.com website
    $crawler = $client->request('GET', $url_to_scrape);


    $opengraphImage =$crawler->filterXpath('//meta[@property="og:image"]')->attr('content');
    $title = $crawler->filter('title')->text();
Sᴀᴍ Onᴇᴌᴀ
  • 8,218
  • 8
  • 36
  • 58
Luna
  • 537
  • 1
  • 12
  • 26
  • 2
    `fullurl.replace('m.',"www");` seems like a bad call, in part because it's going to turn `https://m.datpiff.com/tape/818948` into `https://wwwdatpiff.com/tape/818948` and in part because it's going to replace `http://example.com/m.html` into `http://example.com/wwwhtml`. – ceejayoz Dec 25 '16 at 21:08
  • ceejayoz , my error i'm replacing 'm' with 'www', I've consoled log the url and i get back what i need which is https://www.datpiff.com/tape/818948 , anyIdea on how i can get an answer to the original question., Thanks – Luna Dec 26 '16 at 15:16
  • ceejayoz I understand now what you mean, what I'm doing now, is fullurl.replace("://m.","://www.") – Luna Dec 27 '16 at 18:56
  • besides 'm.' replacement you have to append '?m=0' to the URL. This way the site knows that it has to serve desktop version – Alex Dec 27 '16 at 23:13
  • Alex Giuvara that sounded like it would work, but no, it doesn't do anything – Luna Dec 28 '16 at 20:45
  • 1
    Have you tried to change PHP's user agent? For example, Chrome 54 on Win 10: `ini_set('user_agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36');` – Nikola Miljković Dec 29 '16 at 10:34
  • Where is class _Client_ defined? Is it from a framework or custom in your code library? – Sᴀᴍ Onᴇᴌᴀ Dec 30 '16 at 01:31

3 Answers3

0

You need to pass the cookies for redirect you to desktop version:

name    value      domain          path
mredir    0    .www.datpiff.com     /

It's strange that if you replace m. with www. doesn't work. Try to add the desktop user-agent too.

thejoin
  • 326
  • 2
  • 8
0

Unless you need to use that Client class, you can use file_get_contents() along with DOMDocument (borrowing code from this answer) to get a SimpleXMLElement and call SimpleXMLElement::xpath() to access the open graph tags.

$url = 'https://www.datpiff.com/tape/818948';
$html = file_get_contents($url);
print substr(htmlspecialchars($contents),0,400).'<br />';
$doc = new DOMDocument();
//suppress errors when loading html
@$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);

$images = $xml->xpath('//meta[@property="og:image"]');
if (sizeof($images)) {
    $opengraphImage = (string)$images[0]['content'];
    echo 'opengraph image: '.$opengraphImage.'<br /><br />';
}
$titles = $xml->xpath('//title');
if (sizeof($titles)) {
    $title = (string)$titles[0];
    echo 'title: '.$title.'<br />';
}

See it demonstrated in this playground example.

Community
  • 1
  • 1
Sᴀᴍ Onᴇᴌᴀ
  • 8,218
  • 8
  • 36
  • 58
0

You can set your client to follow redirect responses (HTTP status 3XX + Location header). Add this line after instantiating $client:

$client->followRedirects(true);

It doesn't redirect mobile links from desktop browser, so you still need to replace m. with www.

shudder
  • 2,076
  • 2
  • 20
  • 21