-1

I have a following string:

<pre>one</pre><p><b>two</b></p>\n<pre>DO NOT    MATCH</pre><pre>BALLS</pre>

I want to match <pre></pre> tags and replace them with <p></p>

I do not want to match a part with multiple spaces

<pre>DO NOT!    !MATCH</pre> 

Here it is my regex:

<pre>((?:[^\n]+?))</pre>

It matches all tokens inside <pre></pre> tags that are on a single line.

Actual result:

<p>one</p>
<p><b>two</b></p>\n<p>DO    NOT    MATCH</p>
<p>BALLS</p>

Expected result:

<p>one</p>
<p><b>two</b></p>\n
<p>BALLS</p>

C# flavor demo.

Amessihel
  • 5,891
  • 3
  • 16
  • 40
Bob
  • 1,433
  • 1
  • 16
  • 36
  • 6
    For small things it can be acceptable, but if you're looking for a generic tool for any HTML, I would avoid regex aside from simple selections. You're mathematically disadvantaged by only using regex here. Instead, you might consider an HTML/DOM parser (and swapping those tags yourself). – Rogue Apr 11 '23 at 15:42
  • 5
    Regular expressions are the wrong tool for this job. Use an HTML parser. See https://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c – user229044 Apr 11 '23 at 15:47
  • 1
    FYI `[^\n]+?` is the same as `.+?`. Unless you enable the `DOTALL` flag, `.` doesn't match newlines. – Barmar Apr 11 '23 at 15:49

2 Answers2

1

DISCLAIMER: consider this as an exercise. If you're planning to do something like this in real world development - please don't. Use HTML parser instead.

Since you basically need two different changes: convert good <pre> to <p> and remove bad <pre> let's do in in two steps:

string input = "<pre>one</pre><p><b>two</b></p>\n<pre>DO    NOT    MATCH</pre><pre>BALLS</pre>";

Regex regex_replace = new Regex(@"<pre>((?:(?<!\s{3})(?!</?pre>)[^\n])+?)</pre>");
Regex regex_delete = new Regex(@"<pre>[^\n]*?</pre>");

string result = regex_delete.Replace(regex_replace.Replace(input, "<p>$1</p>\n"), "");
Console.WriteLine(result);

Output:

<p>one</p>
<p><b>two</b></p>
<p>BALLS</p>

Here regex_replace is used to replace good <pre> with <p>. It matches <pre> that don't contain other pre or three subsequent whitespace symbols.

And regex_delete removes all other pre's.

markalex
  • 8,623
  • 2
  • 7
  • 32
0

If you're in full control of the HTML input, you can use this regex:

<pre>((?:[^<\s]\s?)*)</pre>

(?:[^<\s]\s?)* stands for "a sequence of non-blank characters excepted <, followed by at most one blank space, the whole thing repeated 0 or several times".

This sequence is then captured into the group $1 (Demo).


As said by others, don't use regex to parse regular HTML content, or any stuff not belonging to regular languages.

Amessihel
  • 5,891
  • 3
  • 16
  • 40