Search body for document for {~ contents ~}

Question

Alright, so basically I would like to search the Body tags for {~ , then get whatever follows that until ~} and turn that into a string (not including the {~ or ~} ).

Where is the `{~`? Do you want to look through the HTML source, or what? — CertainPerformance, Apr 03 '18 at 01:54

CertainPerformance · Answer 1 · 2018-04-03T02:30:59.337

2

const match = document.body.innerHTML.match(/\{~(.+)~\}/);
if (match) console.log(match[1]);
else console.log('No match found');

<body>text {~inner~} text </body>

edited Apr 03 '18 at 02:30

answered Apr 03 '18 at 02:01

CertainPerformance

356,069
52
309
320

with this, you can only search one match even the html got more than one match. – Wils Apr 03 '18 at 02:28
Edited into a snippet which runs fine, don't know why it wouldn't work for you – CertainPerformance Apr 03 '18 at 02:31

Wils · Answer 2 · 2018-04-03T05:51:02.490

2

$(function(){

var bodyText = document.getElementsByTagName("body")[0].innerHTML;

found=bodyText.match(/{~(.*?)~}/gi);


$.each(found, function( index, value ) {
var ret = value.replace(/{~/g,'').replace(/~}/g,'');
    console.log(ret);
});

});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
   <body> {~Content 1~}

{~Content 2~}
</body>

There you go, put gi at the end of the regex.

edited Apr 03 '18 at 05:51

answered Apr 03 '18 at 02:18

Wils

1,178
8
24

You don't need to install a heavyweight library like jQuery just for iteration - also, if you're just selecting a single element, better to use `querySelector` than to use one of the methods that returns a collection and then select the first element in the collection. – CertainPerformance Apr 03 '18 at 02:27
2

Did he tag the post jQuery? I guess – Wils Apr 03 '18 at 02:29
If the text between the `{~` and `~}` contains a space, such as `{~Content 1~}`, your regex will fail... `\w` only matches ASCII based characters, so any Unicode characters outside the ASCII range would cause it to fail too. – Useless Code Apr 03 '18 at 05:05
@UselessCode fixed. – Wils Apr 03 '18 at 05:51

Useless Code · Answer 3 · 2018-04-03T05:16:07.070

This is a harder problem to solve than it would first appear; things like script tags and comments can throw a wrench into things if you just grab the innerHTML of the body. The following function takes a base element to search, in your case you'll want to pass in document.body, and returns an array containing any of the strings found.

function getMyTags (baseElement) {
  const rxFindTags = /{~(.*?)~}/g;

  // .childNodes contains not only elements, but any text that
  // is not inside of an element, comments as their own node, etc.
  // We will need to filter out everything that isn't a text node
  // or a non-script tag.
  let nodes = baseElement.childNodes;
  let matches = [];
  
  nodes.forEach(node => {
    let nodeType = node.nodeType
    // if this is a text node or an element, and it is not a script tag
    if (nodeType === 3 || nodeType === 1 && node.nodeName !== 'SCRIPT') {
      let html;
      if (node.nodeType === 3) { // text node
        html = node.nodeValue;
      } else { // element
        html = node.innerHTML; // or .innerText if you don't want the HTML
      }

      let match;
      // search the html for matches until it can't find any more
      while ((match = rxFindTags.exec(html)) !== null) {
        // the [1] is to get the first capture group, which contains
        // the text we want
        matches.push(match[1]);
      }
    }
  });

  return matches;

}

console.log('All the matches in the body:', getMyTags(document.body));
console.log('Just in header:', getMyTags(document.getElementById('title')));

<h1 id="title"><b>{~Foo~}</b>{~bar~}</h1>
Some text that is {~not inside of an element~}
<!-- This {~comment~} should not be captured -->
<script>
 // this {~script~} should not be captured
</script>
<p>Something {~after~} the stuff that shouldn't be captured</p>

The regular expression /{~(.*?)~}/g works like this:

{~ start our match at {~
(.*?) capture anything after it; the ? makes it "non-greedy" (also known as "lazy") so, if you have two instances of {~something~} in any of the strings we are searching it captures each individually instead of capturing from the first {~ to the last ~} in the string.
~} says there has to be a ~} after our match.

The g option makes it a 'global' search, meaning it will look for all matches in the string, not just the first one.

Further reading

childNodes
nodeType
Regular-Expressions.info has a great regular expression tutorial.
MDN RegExp documentation

Tools

There are lots of different tools out there to help you develop regular expressions. Here are a couple I've used:

RegExr has a great tool that explains how a particular regular expression works.
RegExPal

The one downside to this approach over just grabbing the `.innerHTML` of the body is that it won't capture something that has the `{~` and `~}` spread across in different nodes or elements. — Useless Code, Apr 03 '18 at 05:10

Search body for document for {~ contents ~}

3 Answers3