0

I'm cleaning the output created by a wysiwyg, where instead of inserting a break it simply creates an empty p tag, but it sometimes creates other empty tags that's not needed.

I have a regex to remove all empty tags, but I want to exclude empty p tags from it. how do I do that?

let s = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>";

s = s.trim().replace( /<(\w*)\s*[^\/>]*>\s*<\/\1>/g, '' )

console.log(s)
totalnoob
  • 2,521
  • 8
  • 35
  • 69
  • Hi there - a bit of [so] tradition is to share this link with anyone attempting to use regex to match HTML content - https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Lix May 16 '18 at 08:09
  • I'd suggest using a dedicated HTML parsing library to perform this task since there are so many edge cases that you would need to handle - it will get very complex and hard to manage. – Lix May 16 '18 at 08:10
  • HTML vs regex - everlasting war :) – Matus Dubrava May 16 '18 at 08:10
  • Your case don't really fall in the exceptions where it could be suitable to use regex for HTML, i fear. You can still inject it in a div and filter the content using DOM, if using a parser bothers you – Kaddath May 16 '18 at 08:11

3 Answers3

1

Add (?!p) to your regex. This is called Negative Lookahead:

let s = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>";

s = s.trim().replace( /<(?!p)(\w*)\s*[^\/>]*>\s*<\/\1>/g, '' )

console.log(s)
Mamun
  • 66,969
  • 9
  • 47
  • 59
1

I understand that you want to use regex for that, but there are better ways. Consider using DOMParser:

var x = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>"
var parse = new DOMParser;
var doc = parse.parseFromString(x,"text/html");
Array.from(doc.body.querySelectorAll("*"))
    .filter((d)=>!d.hasChildNodes() && d.tagName.toUpperCase() !== "P")
    .forEach((d)=>d.parentNode.removeChild(d));
console.log(doc.body.innerHTML);
//"<h1>test</h1><p>a</p><p></p>"

You can wrap the above in a function and modify as you like.

ibrahim tanyalcin
  • 5,643
  • 3
  • 16
  • 22
1

You can use DOMParser to be on the safe side.

let s = "<h1>test</h1><h1></h1><p>a</p><p></p><h2></h2>";

const parser = new DOMParser();
const doc = parser.parseFromString(s, 'text/html');
const elems = doc.body.querySelectorAll('*');

[...elems].forEach(el => {
  if (el.textContent === '' && el.tagName !== 'P') {
    el.remove();
  }
});

console.log(doc.body.innerHTML);
Matus Dubrava
  • 13,637
  • 2
  • 38
  • 54