3

I need to remove some tags from a whole lot of html pages. Lately I discovered the option of regex in Notepad++

But.. Even after hours of Googling I don't seem to get it right. What do I need?

Example:

<p class=MsoNormal style='margin-left:19.85pt;text-indent:-19.85pt'><spanlang=NL style='font-size:11.0pt;font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang=NL style='font-size:9.0pt;font-family:"Arial","sans-serif"'>zware uitvoering met doorzichtige vulruimte;</span></p>

I need to remove everything about styling, classes and id's. So I need to only have the clean tags without anything else.

Anyone able to help me on this one?

Kind regards

EDIT Check an entire file via pastebin: http://pastebin.com/0tNwGUWP

Maarten
  • 229
  • 5
  • 18

5 Answers5

5

I think this pattern will erase all styles in "p" and "span" tags :

((?<=<p)|(?<=<span))[^>]*(?=>)

=> how it works:

  • ( (?<=<p) | (?<=<span) ): This is a LookBehind Block to make sure that the string we are looking for comes after <p OR <span

  • [^>]* : Search for any character that is not a > character

  • (?=>) : This is a LookAfter block to make sure that the
    string we are looking for comes before > character

PS: Tested on Notepad ++

0

If sample you provided is representative of what you need to process, then, the following quick and dirty solution will work:

Find what: [a-z]+='[^']*'
Replace with:

Find what: [a-z]+=[a-zA-Z]*
Replace with:

You must run the first one first to pick up the style='...' attributes and you'll need to run the second next to pickup both the class='...' and lang='...'.

There's good reason why others posters are saying don't attempt to parse HTML this way. You'll end up in all sorts of trouble since regex, in general cannot handle all the wonderful weirdness of HTML.

Stephen Quan
  • 21,481
  • 4
  • 88
  • 75
0

My advise as follows.

As I see in your sample text you have only "p" and "span" tags that need to be handled. And you apparently want to remove all the styles inside them. In this case, you could consider removing everything inside those tags, leave them simple <p> or <span>.

I don't know about Notepad++ but a simple C# program can do this job quickly.

Johnny
  • 481
  • 4
  • 13
0

Assuming <spanlang=NL a typo (should be <span lang=NL), I'd do:

Find what: (<\w+)[^>]*>
Replace with: $1>

Toto
  • 89,455
  • 62
  • 89
  • 125
0

If you don't mind doing a little bit of programming: HTMLAgilityPack can easily remove scripts/styles/wathever from you xml/html.

Example:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

doc.DocumentNode.Descendants()
                .Where(n => n.Name == "script" || n.Name == "style")
                .ToList()
                .ForEach(n => n.Remove());
woutervs
  • 1,500
  • 12
  • 28