Use a regex to get text from html source code

Question

I have got a php code that stores html source code of a site in a variable and I want to get two links from that source code only. First link is in meta tag key content:

<meta property="og:image" content="http://img.xxx.xx/vid/xxx/b7950d611f934f0eef95c1cd010348e3.jpg"/>

And second

jw.load([{ file: 'http://vrbx105.xxx.xx/U7yvQnLiA_m5mhE9MUHf3w/1477628604/vl107aeb2d7db53f91fc6ad2e76fe11e49.mp4', provider: 'http' }]);

I need to get only those two links, they change every time a page is reloaded:

http://img.xxx.xx/vid/xxx/b7950d611f934f0eef95c1cd010348e3.jpg
http://vrbx105.xxx.xx/U7yvQnLiA_m5mhE9MUHf3w/1477628604/vl107aeb2d7db53f91fc6ad2e76fe11e49.mp4

You should use parsers, not regexs for this. For the HTML see: http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php — chris85, Oct 27 '16 at 20:38
Parser seems like overkill, never used it and seems harder than regex. The script I'm writing will be only for me so I need a simple method to do this. Only those links that I've posted above. — buli, Oct 27 '16 at 20:53
It is 2 lines with SimpleXML.. `$string = new SimpleXMLElement('');` then `echo $string['content'];`. Your regex is going to be making assumptions about whitespace usage, quote usage, attribute order etc. — chris85, Oct 27 '16 at 20:58
Hmm ok yeah, but what I mean is, I'm getting whole site source ($page = file_get_contents('http://example.com');) which is a lot of html code. Now I need to get the value of content from meta property og:image, and the other one is Javascript code, so I don't think a parser would be able to get this. https://d3higte790sj35.cloudfront.net/images/md/xs/05f811fc99efc01a5fe93566ed8a0ff3.jpeg this is example of this code. Also the links change all the time — buli, Oct 27 '16 at 21:08
Yea, since you are getting a full website a regex is more likely to fail. Use a parser and you can have it drill down to the element/attribute you want. — chris85, Oct 27 '16 at 21:10
You probably will want to look at http://php.net/manual/en/domdocument.getelementsbytagname.php since it is HTML, not XML. Use the http://php.net/manual/en/domdocument.loadhtml.php, not loadXML (as the example has). — chris85, Oct 27 '16 at 21:13
Ok I've got the meta og:image thing working. But I can't figure how to get that link from javascript code — buli, Oct 27 '16 at 21:37

Thomas Landauer · Answer 1 · 2016-10-27T21:41:45.333

0

If you insist in regex, here's one for the first link: https://regex101.com/r/CHpfDY/1

And here's the second: https://regex101.com/r/VVF0Gf/1

edited Oct 27 '16 at 21:41

answered Oct 27 '16 at 21:29

Thomas Landauer

7,857
10
47
99

Sure, I'm interested. – buli Oct 27 '16 at 21:35

score 0 · Answer 2 · answered Apr 04 '17 at 04:04

Unless you have a PHP JavaScript parser handy, you can at least get rid of the regular expression for the HTML search. Something like this should work, though it's hard to test without the URL...

<?php
$dom=new DomDocument();
$dom->loadHTMLFile("http://example.com/example.html");
$xpath = new DomXpath($dom);

$metanode = $xpath->query("//meta[@property='og:image']/@content");
if ($metanode->length) {
    $url1 = $metanode[0]->value;
}

$scriptnode = $xpath->query("//script");
foreach ($scriptnode as $script) {
    $array = explode("\n", $script->nodeValue);
    foreach ($array as $line) {
        if (preg_match("/jw.load... file: '(.*?)'/", $line, $matches)) {
            $url2 = $matches[1];
            break(2);
        }
    }
}

echo $url1;
echo $url2;

Use a regex to get text from html source code

2 Answers2