Strip certain HTML from string

Question

I am using ngx-quill and the input body returns some HTML elements.

Example

<p><strong><em><u>"Soft fingers began to tap the sill of the car window, and the hard fingers tightened on the restless drawing sticks. In the doorways of the sun-beaten tenant houses, women sighed and then shifted feet so that the one that had been down was now on top, and the toes working. Dogs came sniffing near the owner cars and wetted on all four tires one after another. And chickens lay in the sunny dust and fluffed their feathers </u></em></strong></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><strong><em>to get the cleansing dust

I want to remove all of the HTML tags, except the newline paragraphs.

When a post has multiple lines / breaks, ngx-quill adds several chained  (see above)

I've tried to use the replace function to strip the elements, but certain elements like  are not being removed. Also how can I consolidate the sections that have several line breaks into just one line break

I have tried

post = '<p><strong><em><u>"Soft fingers began to tap the sill of the car window, and the hard fingers tightened on the restless drawing sticks. In the doorways of the sun-beaten tenant houses, women sighed and then shifted feet so that the one that had been down was now on top, and the toes working. Dogs came sniffing near the owner cars and wetted on all four tires one after another. And chickens lay in the sunny dust and fluffed their feathers </u></em></strong></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><strong><em>to get the cleansing dust down to the skin. In the little sties the pigs grunted inquiringly over the muddy remnants of the slops.""Soft fingers began to tap the sill of the car window, and the hard fingers tightened on the restless drawing sticks. In the doorways of the sun-beaten tenant houses, women sighed and then shifted feet so that the one that had been down was now on top, and the toes working. Dogs came sniffing near the owner cars and wetted on all four tires one after another. And chickens lay in the sunny dust and fluffed their feathers to get the cleansing dust down to the skin. In the little sties the pigs grunted inquiringly over the muddy remnants of the slops."</em></strong></p>'

function stripElements(post: any) {
    let newPost = post;
    newPost = newPost.replace('<u>', '<span>');
    newPost = newPost.replace('</u>', '</span>');
    newPost = post.replace('<strong>','');
    newPost = newPost.replace('</strong>', '');
    newPost = newPost.replace('<em>', '');
    newPost = newPost.replace('</em>', '');

    newPost = newPost.replace('<p><br></p>', '<p></p>')
    
    return newPost;
}

I would not recommend using replace to sanitise HTML strings. — evolutionxbox, Jan 16 '22 at 01:11
Does this answer your question? [Simple HTML sanitizer in Javascript](https://stackoverflow.com/questions/1637275/simple-html-sanitizer-in-javascript) — evolutionxbox, Jan 16 '22 at 01:11
Don't replace "strings"? Turn it into a normal DOM tree (using [documentFragment](https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment) if you can't just create a temporary div or the like), and then just use the textContent you need. Then reserialize that (if you actually need that) using `innerHTML` or `outerHTML`. — Mike 'Pomax' Kamermans, Jan 16 '22 at 01:17
First, i think it can be better to use [`.replaceAll()`](https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Global_Objects/String/replaceAll) method — PaulCrp, Jan 16 '22 at 01:21
Your code would work with **1)** You have a typo at the third `replace` => `newPost = post.replace('','');`. -- `post` should be `newPost` like the others. **2)** Using `replaceAll` instead of `replace` would ensure the replacement of all occurances. — Louys Patrice Bessette, Jan 16 '22 at 01:50

skara9 · Answer 1 · 2022-01-16T01:58:30.130

You can use the DOMParser API to parse and manipuate the HTML code:

post = '<p><strong><em><u>"Soft fingers began to tap the sill of the car window, and the hard fingers tightened on the restless drawing sticks. In the doorways of the sun-beaten tenant houses, women sighed and then shifted feet so that the one that had been down was now on top, and the toes working. Dogs came sniffing near the owner cars and wetted on all four tires one after another. And chickens lay in the sunny dust and fluffed their feathers </u></em></strong></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><strong><em>to get the cleansing dust down to the skin. In the little sties the pigs grunted inquiringly over the muddy remnants of the slops.""Soft fingers began to tap the sill of the car window, and the hard fingers tightened on the restless drawing sticks. In the doorways of the sun-beaten tenant houses, women sighed and then shifted feet so that the one that had been down was now on top, and the toes working. Dogs came sniffing near the owner cars and wetted on all four tires one after another. And chickens lay in the sunny dust and fluffed their feathers to get the cleansing dust down to the skin. In the little sties the pigs grunted inquiringly over the muddy remnants of the slops."</em></strong></p>'

function stripElements(post) {
  const doc = new DOMParser().parseFromString(post, 'text/html');
  doc.querySelectorAll('body :not(p)').forEach(el => el.replaceWith(el.textContent))
  return doc.body.innerHTML;
}

console.log(stripElements(post))

@LouysPatriceBessette i think op was trying to remove the `u` entirely, but it didnt work so they troubleshooted with `span` — skara9, Jan 16 '22 at 01:54

score 2 · Accepted Answer · answered Jan 16 '22 at 01:28

Rule #1: Don't manipulate HTML with regexes. Use a DOM parser instead.

Rule #2: You probably don't want to fuss with the overhead of a DOM parser, just want to get the job done, and are likely to ignore Rule #1.

Therefore, if you wish, something like this might do the trick:

return post.replace(/<\/?[a-z]+>/gi, m => m.toLowerCase() === '<br>' ? '<p></p>' : '');

I'm not exactly sure this is how you wanted to handle the line breaks, but given this as a start you should be able to tweak it as you need.

Strip certain HTML from string

2 Answers2