0

I have got a php code that stores html source code of a site in a variable and I want to get two links from that source code only. First link is in meta tag key content:

<meta property="og:image" content="http://img.xxx.xx/vid/xxx/b7950d611f934f0eef95c1cd010348e3.jpg"/>

And second

jw.load([{ file: 'http://vrbx105.xxx.xx/U7yvQnLiA_m5mhE9MUHf3w/1477628604/vl107aeb2d7db53f91fc6ad2e76fe11e49.mp4', provider: 'http' }]);

I need to get only those two links, they change every time a page is reloaded:

http://img.xxx.xx/vid/xxx/b7950d611f934f0eef95c1cd010348e3.jpg
http://vrbx105.xxx.xx/U7yvQnLiA_m5mhE9MUHf3w/1477628604/vl107aeb2d7db53f91fc6ad2e76fe11e49.mp4
miken32
  • 42,008
  • 16
  • 111
  • 154
buli
  • 146
  • 8
  • 2
    You should use parsers, not regexs for this. For the HTML see: http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php – chris85 Oct 27 '16 at 20:38
  • Parser seems like overkill, never used it and seems harder than regex. The script I'm writing will be only for me so I need a simple method to do this. Only those links that I've posted above. – buli Oct 27 '16 at 20:53
  • 2
    It is 2 lines with SimpleXML.. `$string = new SimpleXMLElement('');` then `echo $string['content'];`. Your regex is going to be making assumptions about whitespace usage, quote usage, attribute order etc. – chris85 Oct 27 '16 at 20:58
  • Hmm ok yeah, but what I mean is, I'm getting whole site source ($page = file_get_contents('http://example.com');) which is a lot of html code. Now I need to get the value of content from meta property og:image, and the other one is Javascript code, so I don't think a parser would be able to get this. https://d3higte790sj35.cloudfront.net/images/md/xs/05f811fc99efc01a5fe93566ed8a0ff3.jpeg this is example of this code. Also the links change all the time – buli Oct 27 '16 at 21:08
  • 3
    Yea, since you are getting a full website a regex is more likely to fail. Use a parser and you can have it drill down to the element/attribute you want. – chris85 Oct 27 '16 at 21:10
  • Ok thanks I will work something out with the parser – buli Oct 27 '16 at 21:12
  • You probably will want to look at http://php.net/manual/en/domdocument.getelementsbytagname.php since it is HTML, not XML. Use the http://php.net/manual/en/domdocument.loadhtml.php, not loadXML (as the example has). – chris85 Oct 27 '16 at 21:13
  • Ok I've got the meta og:image thing working. But I can't figure how to get that link from javascript code – buli Oct 27 '16 at 21:37

2 Answers2

0

If you insist in regex, here's one for the first link: https://regex101.com/r/CHpfDY/1

And here's the second: https://regex101.com/r/VVF0Gf/1

Thomas Landauer
  • 7,857
  • 10
  • 47
  • 99
0

Unless you have a PHP JavaScript parser handy, you can at least get rid of the regular expression for the HTML search. Something like this should work, though it's hard to test without the URL...

<?php
$dom=new DomDocument();
$dom->loadHTMLFile("http://example.com/example.html");
$xpath = new DomXpath($dom);

$metanode = $xpath->query("//meta[@property='og:image']/@content");
if ($metanode->length) {
    $url1 = $metanode[0]->value;
}

$scriptnode = $xpath->query("//script");
foreach ($scriptnode as $script) {
    $array = explode("\n", $script->nodeValue);
    foreach ($array as $line) {
        if (preg_match("/jw.load... file: '(.*?)'/", $line, $matches)) {
            $url2 = $matches[1];
            break(2);
        }
    }
}

echo $url1;
echo $url2;
miken32
  • 42,008
  • 16
  • 111
  • 154