2

I want to extract data from a web source but i am getting error in preg match

    <?php

$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];

echo $title;
?>

This is the error i get

Parse error: syntax error, unexpected 'instapp' (T_STRING) in /home/ubuntu/workspace/test.php on line 4

Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?

1 Answers1

0

The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:

preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);

While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/@ symbols.

Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".

HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.

Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:

$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[@content]');
$res = array();
foreach($metas as $m) { 
   array_push($res, $m->getAttribute('content'));
}
print_r($res);

See the PHP demo

And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use

$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[@property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
    $id = $match[1];
}

See another PHP demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks it worked.actually i am a total noob.i just started with learnin PHP and wanted to do this as i had work for this.i wanted to extract more data from the same page.can u please help me extract these two data ? from the page https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty – Professor Zero Nov 25 '16 at 11:23
  • See https://ideone.com/LnhF8M. Actually, I do not understand what you exactly need, but that is already not related to this question. – Wiktor Stribiżew Nov 25 '16 at 11:36
  • its soo hard for me to understart DOM based.can u just make a simple one like u made above? like this preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match); just make a simple line like this that will extract 1329656989202933577 from that page [the value is dynamic] view-source:https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty – Professor Zero Nov 25 '16 at 11:43
  • Well, try `if (preg_match('~"instapp:owner_user_id" content="[^\d"]*(\d+)"~', $html, $match)) { $my_required_val = $match[1]; }` – Wiktor Stribiżew Nov 25 '16 at 11:46
  • you are still not getting it just extract 1329656989202933577 .make a regex that will extract 1329656989202933577 from the above code – Professor Zero Nov 25 '16 at 11:49
  • Replace `instapp:owner_user_id` with `al:ios:url` in the above regex. See https://regex101.com/r/HXKYyJ/1 – Wiktor Stribiżew Nov 25 '16 at 11:50
  • Thank you soo much.It worked.MY work is finally done :D thansk bro – Professor Zero Nov 25 '16 at 11:55