0

I have requirement to extract meta property from scrolled HTML source code. After scrolling HTML code contains as follows

Example:

<meta property="og:site_name" content="asasasas">
<meta property="og:title" content="asajhskajhsaksp;" /> 
<meta property="og:image" content="images.cxs.com/2014/09/modit1.gif?w=209" />

Here I want to get the content of only where meta property="og:image" ie result should be only

images.cxs.com/2014/09/modit1.gif?w=209

RobG
  • 142,382
  • 31
  • 172
  • 209
Kiran
  • 61
  • 1
  • 7
  • 1
    [Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) – Biffen Oct 07 '14 at 06:22
  • @Biffen: What's wrong with using regex for this kind of task? There's no recursion or anything regex can't deal with. – Aran-Fey Oct 07 '14 at 06:49
  • 1
    @Rawing—HTML is not a regular language, it can't be reliably parsed with a regular expression, though you might use regular expressions to tokenise input in an HTML parser. – RobG Oct 07 '14 at 06:52
  • 1
    @Rawing Did you read the answer of the link? What if `property` and `content` are in the reverse order? What if there's some other attribute in there? What if there's a commented-out `meta` element somewhere? What if there's a HTML element in an attribute? I could go on... – Biffen Oct 07 '14 at 06:56

3 Answers3

3

was it so difficult to use jquery

$('meta[property="og:image"]').attr('content')
aelor
  • 10,892
  • 3
  • 32
  • 48
1

As @Biffen said, don't use regex to parse html.

If you have the said string in a variable you can use querySelector() like

var html = '<meta property="og:site_name" content="asasasas" /><meta property="og:title" content="asajhskajhsaksp;" /><meta property="og:image" content="images.cxs.com/2014/09/modit1.gif?w=209" />';
var el = document.createElement('div');
el.innerHTML = html;
var meta = el.querySelector('meta[property="og:image"]');
console.log(meta.content);

document.getElementById('result').innerHTML = meta.content;
<div id="result"></div>

If it is part of the current page then

var meta = document.querySelector('meta[property="og:image"]');
console.log(meta.content);

document.getElementById('result').innerHTML = meta.content;
<meta property="og:site_name" content="asasasas"/>
<meta property="og:title" content="asajhskajhsaksp;" /> 
<meta property="og:image" content="images.cxs.com/2014/09/modit1.gif?w=209" />

<div id="result"></div>
Arun P Johny
  • 384,651
  • 66
  • 527
  • 531
  • Hi @Arun, I am using CURL to crawl website first and store it in file. $ch = curl_init ($url); $fp = fopen ($file, "w") or die("Unable to open ".$file." for writing.\n"); curl_setopt ($ch, CURLOPT_FILE, $fp); curl_close ($ch); fclose ($fp); Now i have HTML code in that file .. So next can i go ahead like how you have suggested above right.. Or is there any other way to get contents of website other than CURL. Bcoz CURL crawls whole page but i want only HEAD section of HTML.. – Kiran Oct 07 '14 at 07:20
0

You can use the approach suggested by Arun, however there may be user agents that don't support the Selectors API or don't support the required features (e.g. IE8). In that case, you can use getElementsByTagName and a plain old for loop.

var node, nodes = document.getElementsByTagName('meta');
for (var i=0, iLen=nodes.length; i<iLen; i++) {
  node = nodes[i];

  if (node.getAttribute('property') == 'og:image') {

    // do something with content
    console.log(node.content);
  } 
}

the above will work in any browser in use and doesn't require any external library.

RobG
  • 142,382
  • 31
  • 172
  • 209