removing html attributes from an html string value using regex

Question

I need to remove html attributes from an html string. I have some formatted text input fields that allows users to copy and paste text while keeping the basic html with it. The issue is that some copied text from a word doc comes with attributes that need to be removed. Currently, The regex I'm using works in a regex tester but none of the attributes are being removed.

Code to remove attributes:

var stringhtml = '<div class="Paragraph  BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'

var regex = /[a-zA-Z]*=".*?"/;

var replacedstring = stringhtml.replace(regex, '');

document.write(replacedstring);

Any help is appreciated!

You forgot the [`g`](https://stackoverflow.com/questions/12993629/what-is-the-meaning-of-the-g-flag-in-regular-expressions) flag: `/[a-zA-Z]*=".*?"/g` — Hao Wu, Oct 01 '21 at 00:55
Can also add the `i` flag and replace the `[a-zA-Z]` with `[a-z]` . Also beware, both `'` and `"` are valid for attribute value strings. You could try this regex `\s*[a-zA-Z]*=["'].*?["']\s*`, as it would also replace the whitepace before and after an attribute if it exists. — Polymer, Oct 01 '21 at 00:59
I'm not sure why you need `.*?`. That seems like invalid regex to me. How is it different than `.*`? — Garr Godfrey, Oct 01 '21 at 01:07

dave · Accepted Answer · 2021-10-01T05:15:46.623

There's quite a lot of literature out there on why parsing HTML with regex can be quite risky – this famous StackOverflow question is a good example.

As @Polymer has pointed out, your current regex will miss attributes with single quotes, but there are other possibilities too: data attributes – e.g data-id="233" will be missed, and also non-quote attributes, like disabled. There could be more!

You can end up always being on catch-up with this approach, always having to change your regex as you encounter new combinations in your HTML.

A safer approach might be to use the DOMParser method to parse your string as HTML, and extract the contents from it that way:

let stringhtml = '<div class="Paragraph  BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'

let parser = new DOMParser();
let parsedResult = parser.parseFromString(stringhtml, 'text/html');

let element = document.createElement(parsedResult.body.firstChild.tagName);

element.innerText = parsedResult.documentElement.textContent;

console.log(element);

removing html attributes from an html string value using regex

1 Answers1