0

I am trying to remove the tables within an HTML file, specifically, for the following document, I'd like to remove anything within the tags <TABLE....> and </TABLE>. The document contains multiple tables with texts in between.

The expression that I came up with, <TABLE.*>\s*[\s|\S]*</TABLE>\s*, however would remove the text in between the tables. In fact it would remove everything between the first <TABLE> and the last </TABLE> tags. I would like to keep the texts in between and only remove the tables. Any suggestion is greatly appreciated. Thanks.

====================

<TABLE STYLE=xxx, Font=yyy, etc>

table texts that should be DELETED...

</TABLE>


other texts that should be KEPT...


<TABLE STYLE=xxx, Font=yyy, etc>

table texts that should be DELETED...

</TABLE>

 ==========================================
johnv
  • 73
  • 2
  • 5
  • 3
    **Just. Don't.** Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Matt Ball Dec 19 '10 at 15:19
  • Regular expressions match _"regular"_ languages. HTML isn't regular. Don't try to parse it using regex. – Phrogz Dec 19 '10 at 16:22

2 Answers2

2

The answer is to use a HTML or SGML parser, there are some around for .NET:

http://htmlagilitypack.codeplex.com/

SGML parser .NET recommendations

If you absolutely want to use regular expressions, familiarize yourself with balancing groups, otherwise nested tables will break. It's not easy, and may perform much slower than a regular SGML parser. Be warned though: Seeing your expression I assume that you are a regex newbie (hint: avoid greedy . matches at any cost), so this is probably not yet your cup of tea.

Community
  • 1
  • 1
Lucero
  • 59,176
  • 9
  • 122
  • 152
2

Since I know you're not going to look at an HTML parser even if I tell you you really should, I'll just answer the question.

This matches only tables:

<table.*?>.*?</table>

It requires two options: dotall and ignoreCase.

You can try it here: http://gskinner.com/RegExr/

                              

Now do consider using HTML Agility Pack suggested by Lucero ok?

Edit: maybe this was what you meant, sorry:

                             

Camilo Martin
  • 37,236
  • 20
  • 111
  • 154
  • Add a nested `TABLE` tag to your sample and it will start to rock! :-) – Lucero Dec 19 '10 at 16:07
  • @Lucero you're right, it breaks at the first sight of a nested table. But again, I guess markup can't be parsed by regex because it's not "regular". Right? In any case, your link does contain a solution to this for .NET (kudos!). – Camilo Martin Dec 19 '10 at 16:25
  • @Camilo, thanks for the kudos! It wasn't meant as critic towards your sample, I only wanted to illustrate the problem why this can only be solved using regular expressions if you du have balancing groups support (which is not part of most common regex engines, but the .NET engine does support it). With those, you can actually have nested start-end matches, so that it can be done. – Lucero Dec 19 '10 at 16:30
  • 1
    Here's a balancing groups sample: `(?<=]*>)((?
    ]*>)|(?<-table>
    )|.)*?(?(table)(?!))(?=)` (also with ignorecase and dotall/singleline options); replace those occurences with an empty string and all the tables (no matter the nested depth) will be correctly emptied whtn using the .NET regex engine.
    – Lucero Dec 19 '10 at 16:42
  • @Lucero This is actually very useful for a lot of situations (even if I'd not use it for HTML tables), since nested structures are a regex problem I had in the past and it solves it beautifully. Also it's a definitive answer for the OP: either use HTML Agility Pack (WaitiN too) or use that regex. – Camilo Martin Dec 19 '10 at 17:01
  • @Camilo, it really is a great solution. But it's easy to get wrong, and as I said there are very few engines supporting that. – Lucero Dec 19 '10 at 17:04