4

I have a website where I feed information to an analytics engine via the meta tag as such:

<meta property="analytics-track" content="Hey&nbsp;There!">

I am trying to write a JavaScript script (no libraries) to access the content section and retrieve the information as is. In essence, it should include the HTML entity and not transform/strip it.

The reason is that I am using PhantomJS to examine which pages have HTML entities in the meta data and remove them as they screw up my analytics data (For example, I'll have entries that include both Hey There! and Hey&nbsp;There! when in fact they are both the same page, and thus should not have two separate data points).

The most simple JS format I have is this:

document.getElementsByTagName('meta')[4].getAttribute("content")

And when I examined it in on console, it returns the text in the following format:

"Hey There!"

What I would like it to return is:

"Hey&nbsp;There!"

How can I ensure that the data returned will keep the HTML entity. If that's not possible, is there a way to detect HTML entity via JavaScript. I tried:

document.getElementsByTagName('meta')[4].getAttribute("content").includes('&nbsp;')

But it returns false

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
Adib
  • 1,282
  • 1
  • 16
  • 32
  • I have to ask: why? If you want the literal value, html encode it. If you're not the one creating the HTML, the creator probably meant it to be encoded like this. But I know nothing. So why? – Rudie Nov 16 '15 at 23:11
  • @Rudie The creator shouldn't be encoding it like this since it'll break our analytics data set. At the moment, we have multiple instances of the same page due to the HTML entities. Even if the creator did it, it still makes 0 sense when it comes to feed it to analytics data. We also have a case where the HTML creator included the trademark entity. – Adib Nov 16 '15 at 23:15

3 Answers3

4

Use queryselector to select the element with the property value "analytics-track", outerHTML to get the element as a String and match to select the unparsed value of the content property with Regex.

document.querySelector('[property=analytics-track]').outerHTML.match(/content="(.*)"/)[1];

See http://jsfiddle.net/sjmcpherso/mz63fnjg/

sjm
  • 5,378
  • 1
  • 26
  • 37
  • This was a brilliant solution! Never going to underestimate the power of regex – Adib Nov 16 '15 at 23:04
  • This solution is quite good because even the include method returns true when I do: `document.querySelector('[property=analytics-track]').outerHTML.match(/content="(.*)"/)[1].includes(' ');` – Adib Nov 16 '15 at 23:05
  • 1
    `"(.*)"` will match including `"`, so if there's another attribute after, it will match up to the end of that one: `content="bla" foo="bar"` will result in `bla" foo="bar`, not just `bla`. – Rudie Nov 16 '15 at 23:13
  • @Rudie Thanks for that note, I modified the regex to be `match(/content="(.*)?"/)` – Adib Nov 16 '15 at 23:19
  • 2
    [Something something regex HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) –  Nov 16 '15 at 23:23
2

You can't, that &nbsp; isn't really there. Its just an encoding for a non-breaking space. To the document, the DOM, the web page, to everything, it looks like:

Hey There!

Except the character between the y and the T isn't a space of the sort you'd get by hitting the space bar, its a completely different character.

Observe:

<span id='a' data-a='Hey&nbsp;There!'></span>
<span id='a1' data-a='Hey&nbsp;There!'></span>
<span id='b' data-b='Hey There!'></span>

var a = document.getElementById('a').getAttribute('data-a')
var a1 = document.getElementById('a1').getAttribute('data-a')
var b = document.getElementById('b').getAttribute('data-b')
console.log(a,b,a==b)
console.log(a,a1,a==a1)

Gives:

Hey There! Hey There! false
Hey There! Hey There! true

Instead, consider altering your method of 'equality' to view a space and a non-breaking space as equal:

var re = '/(\xC2\xA0/|&nbsp;)';
x = x.replace(re, ' ');
Community
  • 1
  • 1
  • 1
    +1 I still think this is a much better solution as it won't break if the meta tags are formed/structured differently (while still being valid HTML). – JCOC611 Nov 16 '15 at 23:06
  • @Ultimater fixed. Thanks –  Nov 16 '15 at 23:22
  • I should've stated that I cannot replace/edit the page since my goal is primarily raising flags (big company politics doesn't allow me to fix/edit the page yada yada yada) – Adib Nov 16 '15 at 23:27
  • You don't need to *replace* the non-breaking spaces, just change the code that considers what is equal. Equality is a tricky thing. I can consider `o`, `O` and `0` to all be equal if I write the right code. If you want your code to consider ` ` and ` ` to be equal, then write the code for that, not try and reverse parse some HTML via regex. –  Nov 16 '15 at 23:30
1

To get the HTML of the meta tag as is, use outerHTML:

document.getElementsByTagName('meta')[4].outerHTML

Working Snippet:

console.log(document.getElementsByTagName('meta')[0].outerHTML);
<meta property="analytics-track" content="Hey&nbsp;There!">
<h3>Check your console</h3>

Element.outerHTML - Web APIs | MDN


Update 1:

To filter out the meta content, use the following:

metaInfo.match(/content="(.*)">/)[1];  // assuming that content attribute is always at the end of the meta tag

Working Snippet:

var metaInfo = document.getElementsByTagName('meta')[0].outerHTML;

console.log(metaInfo);

console.log('Meta Content = ' + metaInfo.match(/content="(.*)">/)[1]);
<meta property="analytics-track" content="Hey&nbsp;There!">
<h3>Check your console</h3>
Rahul Desai
  • 15,242
  • 19
  • 83
  • 138
  • Would recommend against using regular expressions to "parse" html. [Further reading](http://htmlparsing.com/regexes.html) – JCOC611 Nov 16 '15 at 23:02
  • @JCOC611 I understand that but then the encoded text is hard to get IMHO. *EDIT:* If the `content` attribute is always put at the end of the meta tag, then my solution will work. – Rahul Desai Nov 16 '15 at 23:16