Notepad++ Regex to remove styling

Question

I need to remove some tags from a whole lot of html pages. Lately I discovered the option of regex in Notepad++

But.. Even after hours of Googling I don't seem to get it right. What do I need?

Example:

<p class=MsoNormal style='margin-left:19.85pt;text-indent:-19.85pt'><spanlang=NL style='font-size:11.0pt;font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang=NL style='font-size:9.0pt;font-family:"Arial","sans-serif"'>zware uitvoering met doorzichtige vulruimte;</span></p>

I need to remove everything about styling, classes and id's. So I need to only have the clean tags without anything else.

Anyone able to help me on this one?

Kind regards

EDIT Check an entire file via pastebin: http://pastebin.com/0tNwGUWP

Don't use regex to parse HTML: http://stackoverflow.com/a/1732454/2812842 — scrowler, Feb 21 '14 at 03:35
`Correction:` Don't use regex to parse HTML when tags nested inside themselves are involved — Vasili Syrakis, Feb 21 '14 at 03:43
I know there's a lot of wrong/old tags being used in the documents but that is not the case. Just have to clean them out so I can use them. — Maarten, Feb 21 '14 at 03:49

Mourad El Aomari · Answer 1 · 2015-10-17T01:00:32.323

I think this pattern will erase all styles in "p" and "span" tags :

((?<=<p)|(?<=<span))[^>]*(?=>)

=> how it works:

( (?<=<p) | (?<=<span) ): This is a LookBehind Block to make sure that the string we are looking for comes after <p OR <span
[^>]* : Search for any character that is not a > character
(?=>) : This is a LookAfter block to make sure that the
string we are looking for comes before > character

PS: Tested on Notepad ++

score 0 · Answer 2 · answered Feb 21 '14 at 03:55

If sample you provided is representative of what you need to process, then, the following quick and dirty solution will work:

Find what: [a-z]+='[^']*'
Replace with:

Find what: [a-z]+=[a-zA-Z]*
Replace with:

You must run the first one first to pick up the style='...' attributes and you'll need to run the second next to pickup both the class='...' and lang='...'.

There's good reason why others posters are saying don't attempt to parse HTML this way. You'll end up in all sorts of trouble since regex, in general cannot handle all the wonderful weirdness of HTML.

score 0 · Answer 3 · answered Feb 21 '14 at 04:03

My advise as follows.

As I see in your sample text you have only "p" and "span" tags that need to be handled. And you apparently want to remove all the styles inside them. In this case, you could consider removing everything inside those tags, leave them simple <p> or <span>.

I don't know about Notepad++ but a simple C# program can do this job quickly.

score 0 · Answer 4 · answered Feb 21 '14 at 08:31

0

Assuming <spanlang=NL a typo (should be <span lang=NL), I'd do:

Find what: (<\w+)[^>]*>
Replace with: $1>

answered Feb 21 '14 at 08:31

Toto

89,455
62
89
125

score 0 · Answer 5 · answered Feb 21 '14 at 09:10

If you don't mind doing a little bit of programming: HTMLAgilityPack can easily remove scripts/styles/wathever from you xml/html.

Example:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

doc.DocumentNode.Descendants()
                .Where(n => n.Name == "script" || n.Name == "style")
                .ToList()
                .ForEach(n => n.Remove());

Notepad++ Regex to remove styling

5 Answers5