1

I have a html text file and I am trying to remove any HTML tags in tables, i.e. remove any HTML within THE <TABLE> and </TABLE> tags.

However, what's really weird is that the regex that I use, (?<=<table((?!</table).)*)<(?!/table)[^>]+>, works perfectly in PowerGREP or EditPad Pro, however, when applied in vb.NET (or Expresso) to the VERY SAME text, it does NOT work!

I just use a simple replace method: newString = Regex.Replace(oldString, "(?<=<table((?!</table).)*)<(?!/table)[^>]+>", string.Empty, RegexOptions.IgnoreCase)

I'm getting totally confused and am wondering if anyone can help me out and see why this is the case and what change I need to make in order for it to work in .NET. Thanks!

Below is the example text:

================
texttexetext

<TABLE>

  <TAG1>

    <TAG2>tabletext<TAG3>

    <TAG4>

</TABLE>

texttexttext
===============

Final output in PowerGREP is

================
texttexetext

<TABLE>


 tabletext


</TABLE>

texttexttext
===============
johnv
  • 99
  • 1
  • 2
  • 7
  • It's hard to guess what's wrong without seeing the corresponding VB code. As an aside, for what it's worth, in general, to reliably extract information out of HTML, it's better to use an HTML parser (like HTML Agility Pack), since the grammar of HTML isn't regular. Regular expressions are often used by the tokenizer in a parsing solution, but aren't the whole story. – JasonTrue Dec 22 '10 at 18:47
  • 1
    [Haven't you already asked about this issue?](http://stackoverflow.com/questions/4483578/regex-to-parse-html-tables) You shouldn't use regex already. By the way, check if [the options](http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx) are the same (such as ignoreCase). – Camilo Martin Dec 22 '10 at 18:53
  • 1
    I'm pretty sure only Jon Skeet can parse HTML with Regex. Oh wait no he can't http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Conrad Frix Dec 22 '10 at 19:02

1 Answers1

0

It works in EditPadPro if you specify Dot Matches Newline mode. I don't see you doing that in your VB code.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156