5

I am trying to remove the <br /> tags that appear in between the <pre></pre> tags. My string looks like

string str = "Test<br/><pre><br/>Test<br/></pre><br/>Test<br/>---<br/>Test<br/><pre><br/>Test<br/></pre><br/>Test"

string temp = "`##`";
while (Regex.IsMatch(result, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", RegexOptions.IgnoreCase))
{
    result = System.Text.RegularExpressions.Regex.Replace(result, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", "<pre>$1" + temp + "$2</pre>", RegexOptions.IgnoreCase);
}
str = str.Replace(temp, System.Environment.NewLine);

But this replaces all <br> tags between first and the last <pre> in the whole text. Thus my final outcome is:

str = "Test<br/><pre>\r\nTest\r\n</pre>\r\nTest\r\n---\r\nTest\r\n<pre>\r\nTest\r\n</pre><br/>Test"

I expect my outcome to be

str = "Test<br/><pre>\r\nTest\r\n</pre><br/>Test<br/>---<br/>Test<br/><pre>\r\nTest\r\n</pre><br/>Test"
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
Ashish
  • 2,544
  • 6
  • 37
  • 53
  • 1
    Is the format of the string always the same, that is, is it regular? Or are you trying to get this out of whole HTML pages that might be in completely different structures? – Oded Aug 13 '10 at 06:53
  • 4
    *sigh* http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Robin Day Aug 13 '10 at 06:55
  • 3
    There's a lot of wisdom that regex and html are **not** good friends. It might work for some *limited* scenarios, but personally I'd be using a parser/DOM/whatever here. – Marc Gravell Aug 13 '10 at 06:55
  • ok, so if I agree that parsing html with regex is not a good option. so then is it that the regex will only parse tags between first and last `
    ` tags?
    – Ashish Aug 13 '10 at 07:09
  • how can "\
    " match "
    "?
    – Kikaimaru Aug 13 '10 at 07:24

4 Answers4

3

If you are parsing whole HTML pages, RegEx is not a good choice - see here for a good demonstration of why.

Use an HTML parser such as the HTML Agility Pack for this kind of work. It also works with fragments like the one you posted.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
2

Don't use regex to do it.

"Be lazy, use CPAN and use HTML::Sanitizer." -Jeff Atwood, Parsing Html The Cthulhu Way

user375049
  • 277
  • 6
  • 16
0
        string input = "Test<br/><pre><br/>Test<br/></pre><br/>Test<br/>---<br/>Test<br/><pre><br/>Test<br/></pre><br/>Test";
        string pattern = @"<pre>(.*)<br/>(([^<][^/][^p][^r][^e][^>])*)</pre>";
        while (Regex.IsMatch(input, pattern))
        {
            input = Regex.Replace(input, pattern, "<pre>$1\r\n$2</pre>");
        }

this will probably work, but you should use html agility pack, this will not match <br> or <br /> etc.

Kikaimaru
  • 1,841
  • 1
  • 16
  • 28
0

Ok. So I discovered the issue with my code. The problem was that, Regex.IsMatch was considering just the first occurrence of <pre> and the last occurrence of </pre>. I wanted to consider individual sets of <pre> for replacements. So I modified my code as

foreach (Match regExp in Regex.Matches(str, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", RegexOptions.IgnoreCase)) 
{
    matchFound = true;
    str = str.Replace(regExp.Value, regExp.Value.Replace("<br>", temp));
}

and it worked well. Anyways thanks all for your replies.

Ashish
  • 2,544
  • 6
  • 37
  • 53