Get background image from webpage using DOM XPATH

Question

I'm reading a webpage using PHP DOM/XPath and I've managed to get the text I need, but now I'm trying to get the src of the main image but I can't get it. Also to complicate things, the source is different to the inspector.

Here is the source:

<div id="bg">
            <img src="https://example.com/image.jpg" alt=""/>
</div>

And here is the element in the inspector:

<div class="media-player" id="media-player-0" style="width: 320px; height: 320px; background: url(&quot;https://example.com/image.jpg&quot;) center center / cover no-repeat rgb(208, 208, 208);" currentmouseover="16">

I've tried:

$img = $xpath->evaluate('substring-before(substring-after(//div[@id=\'bg\']/img, "\')")');

and

$img = $xpath->evaluate('substring-before(substring-after(//div[@class=\'media-player\']/@style, "background: url(\'"), "\')")');

but get nothing from either.

Here is my complete code:

$html = file_get_contents($externalurl);
$doc = new DOMDocument();
    @$doc->loadHTML($html);
    $xpath = new DOMXPath($doc);
    $allChildNodesFromDiv = $xpath->query('//h1[@class="artist"]');
    $releasetitle = $allChildNodesFromDiv->item(0)->textContent;
    echo "</br>Title: " . $releasetitle;

    $img = $xpath->evaluate('substring-before(substring-after(//div[@class=\'media-player\']/@style, "background: url(\'"), "\')")');    
    echo $image;

    $img = $xpath->evaluate('substring-before(substring-after(//div[@id=\'bg\']/img, "\')")');
    echo $image;

Here is the URL I'm scraping: https://lnk.to/Michael-Gray-Rework and this is what I'm trying to get: https://284fc2d5f6f33a52cd9f-ce476c3c56a27f320262daffab84f1af.ssl.cf3.rackcdn.com/artwork_5e74a44e1e004_CHAMPDL879D_5e74a44e4672b.jpg — TomC, Apr 04 '20 at 16:21
It looks like this data is loaded in javascript, if you save `$html` and then look through that source - `media-player` isn't set anywhere. — Nigel Ren, Apr 04 '20 at 16:31
Ah yes, it appears in: `poster : 'https://284fc2d5f6f33a52cd9f-ce476c3c56a27f320262daffab84f1af.ssl.cf3.rackcdn.com/artwork_5e74a44e1e004_CHAMPDL879D_5e74a44e4672b.jpg'` is there a way to grab that or should I look at something like `stripos()`? — TomC, Apr 04 '20 at 16:48

score 2 · Accepted Answer · answered Apr 04 '20 at 16:55

2

Not something I would normally suggest, but as the particular content you are after is loaded from javascript, BUT the content is in <script> tags, then it may be an easy one for a regex to extract. From your comment...

Ah yes, it appears in: poster : 'https://284fc2d5f6f33a52cd9f-ce476c3c56a27f320262daffab84f1af.ssl.cf3.rackcdn.com/artwork_5e74a44e1e004_CHAMPDL879D_5e74a44e4672b.jpg'

So this code looks the value of poster : '...',.

$html = file_get_contents($externalurl);

preg_match("/poster : '(.*)',/", $html, $matches);
echo $matches[1];

This can be prone to changes in the html, but it may work for now.

answered Apr 04 '20 at 16:55

Nigel Ren

56,122
11
43
55

Thanks - that works for me. Any reason why you wouldn't normally suggest it? – TomC Apr 04 '20 at 17:05
2

Normally if you are processing HTML tags, regexes are the last thing you should use DOMDocument as you were. A good post on this is https://stackoverflow.com/a/1732454/1213708 – Nigel Ren Apr 04 '20 at 17:10

Get background image from webpage using DOM XPATH

1 Answers1