Regex - Get all text between > and <, however don't get text if its between >

Question

I am having a problem trying to figure out this regex code. Let me show the code and then kinda explain from that.

<div class="test">
    Nisl rhoncus mattis rhoncus urna neque viverra. Senectus et netus et
    malesuada.
    <button class="button">Feugiat nisl pretium fusce id velit.</button>
    Turpis nunc eget lorem dolor sed viverra ipsum nunc. Gravida dictum
    fusce ut placerat. Viverra maecenas accumsan lacus vel facilisis.
 </div>

So I am trying to have the regex get between the > and < however ignore <button class="button">Feugiat nisl pretium fusce id velit.</button>

So the regex should parse this and return only these lines:

Nisl rhoncus mattis rhoncus urna neque viverra. Senectus et netus et malesuada.

Turpis nunc eget lorem dolor sed viverra ipsum nunc. Gravida dictum fusce ut placerat. Viverra maecenas accumsan lacus vel facilisis.

Also because it will be for parsing HTML I need it to ignore completely empty spaces. What I mean is that

   </div>
</div>

Technically the space between is valid but I don't want to capture spaces unless there is text included and not only spaces. Imagine getting paragraphs between HTML. The idea is taking all text between divs and then wrapping it with a <p> however certain cases like buttons shouldn't have a <p> inside.

Im not sure if this is even possible. I hope this makes sense and any help would be appreciated!

Edit:

Using NodeJS

I am trying to parse from MD files (Github markdown files) to HTML with a middle step in between of custom syntax like this:

[flipcard]
[front color:blue]
Morbi tristique senectus et netus et malesuada. Interdum consectetur libero id faucibus nisl tincidunt. Purus faucibus ornare suspendisse sed nisi. Laoreet id donec ultrices tincidunt arcu. Elementum pulvinar etiam non quam lacus suspendisse faucibus.
[/front]
[back]
Nisl rhoncus mattis rhoncus urna neque viverra. Senectus et netus et malesuada.
[button]Feugiat nisl pretium fusce id velit.[/button] Turpis nunc eget lorem dolor sed viverra ipsum nunc. Gravida dictum fusce ut placerat. Viverra maecenas accumsan lacus vel facilisis. Nascetur ridiculus mus mauris vitae ultricies leo integer. Pellentesque pulvinar pellentesque habitant morbi tristique senectus et netus. Velit laoreet id donec ultrices.[/back] 
[/flipcard]

It's almost like custom HTML but within markdown files and will then convert out to this using my custom parser and once my parser is done I run it through the NodeJS Marked parser to catch the remaining markdown elements. Then I get this:

 <div class="flipcard">
      <div class="front" style="color: blue">
        Morbi tristique senectus et netus et malesuada. Interdum consectetur
        libero id faucibus nisl tincidunt. Purus faucibus ornare suspendisse sed
        nisi. Laoreet id donec ultrices tincidunt arcu. Elementum pulvinar etiam
        non quam lacus suspendisse faucibus.
      </div>
      <div class="back">
        Nisl rhoncus mattis rhoncus urna neque viverra. Senectus et netus et
        malesuada.
        <button class="button">Feugiat nisl pretium fusce id velit.</button>
        Turpis nunc eget lorem dolor sed viverra ipsum nunc. Gravida dictum
        fusce ut placerat. Viverra maecenas accumsan lacus vel facilisis.
        Nascetur ridiculus mus mauris vitae ultricies leo integer. Pellentesque
        pulvinar pellentesque habitant morbi tristique senectus et netus. Velit
        laoreet id donec ultrices.
      </div>
    </div>

This is very close to what I need but I need the final output to be like this:

<div class="flipcard">
      <div class="front" style="color: blue">
        <p>
          Morbi tristique senectus et netus et malesuada. Interdum consectetur
          libero id faucibus nisl tincidunt. Purus faucibus ornare suspendisse
          sed nisi. Laoreet id donec ultrices tincidunt arcu. Elementum pulvinar
          etiam non quam lacus suspendisse faucibus.
        </p>
      </div>
      <div class="back">
        <p>
          Nisl rhoncus mattis rhoncus urna neque viverra. Senectus et netus et
          malesuada.
        </p>
        <button class="button">Feugiat nisl pretium fusce id velit.</button>
        <p>
          Turpis nunc eget lorem dolor sed viverra ipsum nunc. Gravida dictum
          fusce ut placerat. Viverra maecenas accumsan lacus vel facilisis.
          Nascetur ridiculus mus mauris vitae ultricies leo integer.
          Pellentesque pulvinar pellentesque habitant morbi tristique senectus
          et netus. Velit laoreet id donec ultrices.
        </p>
      </div>
    </div>

Don't parse HTML with regex. Regex is us fundamentally unable to parse HTML. Use an HTML parser. — Tomalak, Nov 27 '20 at 07:30
Do you have any recommendations for html parsers? I tried html-parser for NodeJS and it didn't work. The reason I am doing it like this, is I have custom content like [front] [/front] inside of an MD file. So I run my custom parser to get all of those elements than an MD parser to HTML but the p tags aren't quite right and need to be removed and re-added in specific locations like above. — Zodsmar, Nov 27 '20 at 07:35
This is all valuable info that needs to be in the question: That you're using node, that you already tried a parser, which one, what code you tried, what "didn't work" means, what "MD" files are, and example of those files, an example of the required output. — Tomalak, Nov 27 '20 at 07:39
Thanks. I updated it with all the information and what the final output needs to be and why I am trying to achieve this this way. — Zodsmar, Nov 27 '20 at 07:50
Okay so you have almost everything solved using parsers, all you need is wrap lines inside the text of `div.flipcard > d` in `
`? — Tomalak, Nov 27 '20 at 07:54
Does this answer your question? [Parse an HTML string with JS](https://stackoverflow.com/questions/10585029/parse-an-html-string-with-js) — Liam, Nov 27 '20 at 08:35

Tomalak · Accepted Answer · 2020-11-27T08:39:50.303

From what I understand, you have HTML generated from a parser like this:

var htmlFragment = `<div class="flipcard">
  <div class="front" style="color: blue">
    Morbi tristique senectus et netus et malesuada. Interdum consectetur
    libero id faucibus nisl tincidunt. Purus faucibus ornare suspendisse sed
    nisi. Laoreet id donec ultrices tincidunt arcu. Elementum pulvinar etiam
    non quam lacus suspendisse faucibus.
  </div>
  <div class="back">
    Nisl rhoncus mattis rhoncus urna neque viverra. Senectus et netus et
    malesuada.
    <button class="button">Feugiat nisl pretium fusce id velit.</button>
    Turpis nunc eget lorem dolor sed viverra ipsum nunc. Gravida dictum
    fusce ut placerat. Viverra maecenas accumsan lacus vel facilisis.
    Nascetur ridiculus mus mauris vitae ultricies leo integer. Pellentesque
    pulvinar pellentesque habitant morbi tristique senectus et netus. Velit
    laoreet id donec ultrices.
  </div>
</div>`;

and you want to wrap each "naked" text node child of div.flipcard > div in its own <p>.

If you're familiar with jQuery, this is an easy operation there. Select target nodes, call .wrap('<p>'). Cheerio is the jQuery equivalent in node, so if you want the same convenience, you can have it:

const cheerio = require('cheerio');

$doc = cheerio.load('<div>' + htmlFragment + '</div>');

$doc.find('div.flipcard > div').contents().filter(function () {
    return this.nodeType === this.TEXT_NODE;
}).wrap("<p>");

console.log($doc.html());

prints this:

<div class="flipcard">
  <div class="front" style="color: blue"><p>
    Morbi tristique senectus et netus et malesuada. Interdum consectetur
    libero id faucibus nisl tincidunt. Purus faucibus ornare suspendisse sed
    nisi. Laoreet id donec ultrices tincidunt arcu. Elementum pulvinar etiam
    non quam lacus suspendisse faucibus.
  </p></div>
  <div class="back"><p>
    Nisl rhoncus mattis rhoncus urna neque viverra. Senectus et netus et
    malesuada.
    </p><button class="button">Feugiat nisl pretium fusce id velit.</button><p>
    Turpis nunc eget lorem dolor sed viverra ipsum nunc. Gravida dictum
    fusce ut placerat. Viverra maecenas accumsan lacus vel facilisis.
    Nascetur ridiculus mus mauris vitae ultricies leo integer. Pellentesque
    pulvinar pellentesque habitant morbi tristique senectus et netus. Velit
    laoreet id donec ultrices.
  </p></div>
</div>

Of course you can do the same thing with a tiny bit more legwork using a regular DOM parser such as jsdom along the lines of this:

document.querySelectorAll('div.flipcard > div').forEach(div => {
    for (let i = div.childNodes.length - 1; i >= 0; i--) {  // work from the end so we don't mess up the index
        let child = div.childNodes[i];
        if (child.nodeType == child.TEXT_NODE) {            // if we're at a text node
            let p = document.createElement('P');            // create `<p>`
            div.insertBefore(p, child.nextSibling);         // append that after the text node
            p.appendChild(child);                           // move the text node into the `<p>`
        }
    }
});

I need to look more into this but would something like this work for many different cases. Flipcard is just one case what if let's say there is another one called flipcard2. Or if the divs go 5 layers deep. My idea with the regex is that it would catch all of these cases. Or for some reason some left text after the flipcard and it is ``` left over text
``` Working with Markdown files people do dumb things and I need essentially error catching to wrap any remaining text in ```
``` tags — Zodsmar, Nov 27 '20 at 08:28
@Zodsmar Forget regex. Regex cannot deal with HTML, that's is a reality. And yes, this can work with any situation you care to implement, with cheerio it would not even be difficult to write a pretty flexible solution. — Tomalak, Nov 27 '20 at 08:35
I will look into solving it that way. The only thing I am trying to understand is text is text where I use >'s or ]'s. Whats the difference between parsing [front] some text [/front] or ```

``` my first part parsing the flipcard and converting it to HTML is all done via Regex. — Zodsmar, Nov 27 '20 at 08:46
@Zodsmar Well, it shouldn't be, but you can get away with it. Regex is bad at *anything* that has nesting, it's not about square brackets vs. angle brackets. *"Converting `[bla]` to `
`"* is a much simpler operation than *"finding all text node children in this nested tree of HTML nodes, and transforming them to `
`"*. That's why you get away with regex in the first case, but not in the second case. — Tomalak, Nov 27 '20 at 08:55
...but if there is a parser for your input format (looks remotely like BBCode to me), you probably should use that parser instead of rolling your own regex-based parsing code. Of course if you wrote the code that converts `[front color:blue]` to `
`, then I don't see why that code is not also inserting the `
` at the right spots. — Tomalak, Nov 27 '20 at 08:57
Converting ```[front color:blue]``` to ```
``` is actually all done via Regex. It's not inserting ```
``` at the right spots because its not aware of everything only whats between it. — Zodsmar, Nov 27 '20 at 09:04
I *know* that it's done via regex, and I explained why you get away with it two comments up. — Tomalak, Nov 27 '20 at 09:06
No I get that. I probably need to rethink the way the entire is handled from the ground up anyway. — Zodsmar, Nov 27 '20 at 09:16
@Zodsmar Assuming that actual people write the `[flipcard]` markup, then converting that to rough HTML (via regex, as you currently do, no issues there), and throwing the result into an HMTL parser is not a bad strategy. The HTML parser will sort out bad nesting and all the other things that can go wrong when people write markup code, and it allows you to fine-tune the result with ease (and without pulling your hair out because your regex slowly gets out of control but just discovered that it *still* does not work right, and it never will). — Tomalak, Nov 27 '20 at 09:24

Regex - Get all text between > and <, however don't get text if its between >

1 Answers1