Regular Expression to select part in HTML

Question

I have requirement to extract meta property from scrolled HTML source code. After scrolling HTML code contains as follows

Example:

<meta property="og:site_name" content="asasasas">
<meta property="og:title" content="asajhskajhsaksp;" /> 
<meta property="og:image" content="images.cxs.com/2014/09/modit1.gif?w=209" />

Here I want to get the content of only where meta property="og:image" ie result should be only

images.cxs.com/2014/09/modit1.gif?w=209

[Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) — Biffen, Oct 07 '14 at 06:22
@Biffen: What's wrong with using regex for this kind of task? There's no recursion or anything regex can't deal with. — Aran-Fey, Oct 07 '14 at 06:49
@Rawing—HTML is not a regular language, it can't be reliably parsed with a regular expression, though you might use regular expressions to tokenise input in an HTML parser. — RobG, Oct 07 '14 at 06:52
@Rawing Did you read the answer of the link? What if `property` and `content` are in the reverse order? What if there's some other attribute in there? What if there's a commented-out `meta` element somewhere? What if there's a HTML element in an attribute? I could go on... — Biffen, Oct 07 '14 at 06:56

score 3 · Answer 1 · answered Oct 07 '14 at 06:23

3

was it so difficult to use jquery

$('meta[property="og:image"]').attr('content')

answered Oct 07 '14 at 06:23

aelor

10,892
3
32
48

1

There is no jQuery tag or mention of it in the OP. – RobG Oct 07 '14 at 06:50
1

There was a mention of javascript, so I thought that a jquery solution may also suffice – aelor Oct 07 '14 at 07:00

score 1 · Accepted Answer · answered Oct 07 '14 at 06:33

As @Biffen said, don't use regex to parse html.

If you have the said string in a variable you can use querySelector() like

var html = '<meta property="og:site_name" content="asasasas" /><meta property="og:title" content="asajhskajhsaksp;" /><meta property="og:image" content="images.cxs.com/2014/09/modit1.gif?w=209" />';
var el = document.createElement('div');
el.innerHTML = html;
var meta = el.querySelector('meta[property="og:image"]');
console.log(meta.content);

document.getElementById('result').innerHTML = meta.content;

<div id="result"></div>

If it is part of the current page then

var meta = document.querySelector('meta[property="og:image"]');
console.log(meta.content);

document.getElementById('result').innerHTML = meta.content;

<meta property="og:site_name" content="asasasas"/>
<meta property="og:title" content="asajhskajhsaksp;" /> 
<meta property="og:image" content="images.cxs.com/2014/09/modit1.gif?w=209" />

<div id="result"></div>

Hi @Arun, I am using CURL to crawl website first and store it in file. $ch = curl_init ($url); $fp = fopen ($file, "w") or die("Unable to open ".$file." for writing.\n"); curl_setopt ($ch, CURLOPT_FILE, $fp); curl_close ($ch); fclose ($fp); Now i have HTML code in that file .. So next can i go ahead like how you have suggested above right.. Or is there any other way to get contents of website other than CURL. Bcoz CURL crawls whole page but i want only HEAD section of HTML.. — Kiran, Oct 07 '14 at 07:20

score 0 · Answer 3 · answered Oct 07 '14 at 07:00

You can use the approach suggested by Arun, however there may be user agents that don't support the Selectors API or don't support the required features (e.g. IE8). In that case, you can use getElementsByTagName and a plain old for loop.

var node, nodes = document.getElementsByTagName('meta');
for (var i=0, iLen=nodes.length; i<iLen; i++) {
  node = nodes[i];

  if (node.getAttribute('property') == 'og:image') {

    // do something with content
    console.log(node.content);
  } 
}

the above will work in any browser in use and doesn't require any external library.

Regular Expression to select part in HTML

3 Answers3