Pure Regex solution for getting text content from a string of HTML in an environment where I cannot rely on document.createElement?

Question

I have strings of HTML and I want to get the text content of the elements, but the environment I'm working in doesn't allow me to create an element and then simply get innerText like:

const span = document.createElement('span');
span.innerHTML = myHtmlString;
const justTheText = span.innerText;

Is it possible to do this with only Regex? I've given it a number of attempts, but never come up with a working solution. The nested nature of the tags leads to me getting 90% working solutions, but I can't find any way to handle that aspect. (Apologies for not having an example of one of my attempts, I'm just revisiting this issue after abandoning it months ago after spending multiple days on it.)

I've also never found a workaround, regex or not, as 99.999% of the time the right answer is to use the code I posted above, and that's exactly the answer that's given.

(I'd also be open to non-regex solutions)

Edit:

Example of HTML String:

<div>
  <p class="someclass">
      Some plain text 
        <strong>
          and some bold
        </strong>
  </p>
</div>

Getting the text from a single html element via regex is easy, but I'm not sure there's any way to handle the nesting to get the result: Some plain text and some bold - If there is a way I'm not aware of it, but some of the most advanced features of regex are still beyond my understanding.

https://stackoverflow.com/questions/38343951/how-do-i-parse-an-html-file-in-react-native — VLAZ, Sep 15 '19 at 21:26
All you have to do is give an example of what you're trying to match and what you're not. I'm %100 sure you will not get a regex answer without doing that. — , Sep 15 '19 at 22:00
I'm pretty sure I got my hopes up for react-native-html-parser, but since it emulates the functionality rather than ports it, if I recall you cannot actually get `innerText` or any sort of equivalent. @sln I'll update my post to include a sample HTML string. — Slbox, Sep 15 '19 at 22:13
“Node tools” that would parse HTML are often just pure JavaScript – give them (especially the ones that also claim browser compatibility) a second look. — Ry-, Sep 15 '19 at 22:23
Parsing HTML with regex is [fraught with danger](https://stackoverflow.com/a/1732454/4665) — Jon P, Sep 15 '19 at 22:41
@JonP - There is nothing that can't be parsed with regex, even binary. — , Sep 17 '19 at 20:37

score 2 · Answer 1 · 2019-09-17T20:48:45.743

You could always get the content of a tag.
From the content, remove the inner tags, then trim the whitespace.

In the example we're using the div tag, but you could also use
any tag with attributes, like the p tag below.

Here is a JS example:

var tag = "div";  
// var tag = "p";   // <= try this; works with tags with attributes as well

var rxTagContent = new RegExp( "<" + tag + "(?:\\s*>|\\s+(?=((?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+))\\1>)((?:(?=(<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\4\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>|[\\S\\s]))\\3)*?)</" + tag + "\\s*>", "g" );

var rxRmvInnerTags = 
/<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/g;

var rxWspTrim = /\s+/g;

////////////////////////////////////////////////
//
var html = 
"<div>\n" +
"  <p class=\"someclass\">\n" +
"      Some plain text \n" +
"        <strong>\n" +
"          and some bold\n" +
"        </strong>\n" +
"  </p>\n" +
"</div>\n";

var match;

while ( match = rxTagContent.exec( html ) )
{
  var cont = match[2]; // group 2 is content
  var clean = cont.replace( rxRmvInnerTags, "" );
  var trim  = clean.replace( rxWspTrim, " " );

  console.log ("content = " +  cont );
  console.log ("clean and trim = \n" +  trim );
}

This is the expanded, readable version of the constructed Tag Content regex.

Note that this regex and the one to remove the inner tags are
slightly sophisticated. Should you need specific information on
how they work just let me know. I usually show up every few days,
sometimes a week or two depending how many of my comments are
being deleted by administrator whoever ...

Update: Modified regex to avoid matching the closing tag text
if it happens to be inside a CDATA or even if it's part of another
tag's value, or even if it's in invisible content like a script.

For example, this below will match correctly.

Note the only thing missing is the ability to nest the tag.
This being JavaScript it's not possible. Regex can be used to
find tags and content a piece at a time for a fully custom parse.
But that's a different story.

This though, is going to find the first open tag and the first close tag.
It still can be modified 1 step further to find an un-nested
open / close tag if needed, a simple added assertion is needed.

Also note that this doesn't prevent matching the open tag
if it happens to be inside a CDATA or others as stated above.
This can be avoided but requires expansion of the tag regex and a check within the while() loop to go past these.
Let me know if you may need this ( or I just may add that in a
day or so. I don't want it to be too out of control ), it is possible though.

<tag> 

   Some content
   more
   and more

   <script>
      var xyz;
      var tag = "</tag>";
   </script>

   <![CDATA[ </tag> asdfasdf]]>

</tag>

https://regex101.com/r/Bs4ySe/1

 <tag
 (?:
      \s* >
   |  \s+ 
      (?=
           (                             # (1 start)
                (?:
                     " [\S\s]*? "
                  |  ' [\S\s]*? '
                  |  (?:
                          (?! /> )
                          [^>] 
                     )?
                )+
           )                             # (1 end)
      )
      \1 >
 )
 (                             # (2 start)
      (?:
           (?=
               (                        # (3 start)

                     <(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\4\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
                  |  [\S\s]
               )                        # (3 end)
           )
           \3 
      )*?
 )                             # (2 end)
 </tag \s* >

This looks really promising! As soon as I can loop back to this task I'll update and accept if it solves the issue. I don't need to worry about CDATA but I appreciate the robustness! — Slbox, Sep 17 '19 at 21:55

score 0 · Answer 2 · answered Sep 22 '19 at 00:03

The regex example above is very good. Creating groups with () is the key because then you can pick out the text by itself. I would try to take a slightly simpler approach using recursion to deal with the nesting

An alternate approach is to use the npm package "cheerio". This is commonly used in web scraping but you could feed it any html. Then methods similar to jQuery can be used to traverse the html and pick out the content

Pure Regex solution for getting text content from a string of HTML in an environment where I cannot rely on document.createElement?

2 Answers2