regex exclude certain tag

Question

I'm cleaning the output created by a wysiwyg, where instead of inserting a break it simply creates an empty p tag, but it sometimes creates other empty tags that's not needed.

I have a regex to remove all empty tags, but I want to exclude empty p tags from it. how do I do that?

let s = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>";

s = s.trim().replace( /<(\w*)\s*[^\/>]*>\s*<\/\1>/g, '' )

console.log(s)

Hi there - a bit of [so] tradition is to share this link with anyone attempting to use regex to match HTML content - https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Lix, May 16 '18 at 08:09
I'd suggest using a dedicated HTML parsing library to perform this task since there are so many edge cases that you would need to handle - it will get very complex and hard to manage. — Lix, May 16 '18 at 08:10
Your case don't really fall in the exceptions where it could be suitable to use regex for HTML, i fear. You can still inject it in a div and filter the content using DOM, if using a parser bothers you — Kaddath, May 16 '18 at 08:11

Mamun · Answer 1 · 2018-05-16T08:23:44.427

1

Add (?!p) to your regex. This is called Negative Lookahead:

let s = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>";

s = s.trim().replace( /<(?!p)(\w*)\s*[^\/>]*>\s*<\/\1>/g, '' )

console.log(s)

edited May 16 '18 at 08:23

answered May 16 '18 at 08:10

Mamun

66,969
9
47
59

score 1 · Answer 2 · answered May 16 '18 at 08:18

I understand that you want to use regex for that, but there are better ways. Consider using DOMParser:

var x = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>"
var parse = new DOMParser;
var doc = parse.parseFromString(x,"text/html");
Array.from(doc.body.querySelectorAll("*"))
    .filter((d)=>!d.hasChildNodes() && d.tagName.toUpperCase() !== "P")
    .forEach((d)=>d.parentNode.removeChild(d));
console.log(doc.body.innerHTML);
//"<h1>test</h1><p>a</p><p></p>"

You can wrap the above in a function and modify as you like.

that is a great answer. is there a more efficient way with jQuery or ES6? thanks — totalnoob, May 16 '18 at 08:22

Matus Dubrava · Accepted Answer · 2018-05-16T08:32:47.010

1

You can use DOMParser to be on the safe side.

let s = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>";

const parser = new DOMParser();
const doc = parser.parseFromString(s, 'text/html');
const elems = doc.body.querySelectorAll('*');

[...elems].forEach(el => {
  if (el.textContent === '' && el.tagName !== 'P') {
    el.remove();
  }
});

console.log(doc.body.innerHTML);

edited May 16 '18 at 08:32

answered May 16 '18 at 08:21

Matus Dubrava

13,637
2
38
54

regex exclude certain tag

3 Answers3