Which regex to use to retrieve only text without repeated characters?

Question

I have a following string:

<pre>one</pre><p><b>two</b></p>\n<pre>DO NOT    MATCH</pre><pre>BALLS</pre>

I want to match <pre></pre> tags and replace them with <p></p>

I do not want to match a part with multiple spaces

<pre>DO NOT!    !MATCH</pre>

Here it is my regex:

<pre>((?:[^\n]+?))</pre>

It matches all tokens inside <pre></pre> tags that are on a single line.

Actual result:

<p>one</p>
<p><b>two</b></p>\n<p>DO    NOT    MATCH</p>
<p>BALLS</p>

Expected result:

<p>one</p>
<p><b>two</b></p>\n
<p>BALLS</p>

C# flavor demo.

For small things it can be acceptable, but if you're looking for a generic tool for any HTML, I would avoid regex aside from simple selections. You're mathematically disadvantaged by only using regex here. Instead, you might consider an HTML/DOM parser (and swapping those tags yourself). — Rogue, Apr 11 '23 at 15:42
Regular expressions are the wrong tool for this job. Use an HTML parser. See https://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c — user229044, Apr 11 '23 at 15:47
FYI `[^\n]+?` is the same as `.+?`. Unless you enable the `DOTALL` flag, `.` doesn't match newlines. — Barmar, Apr 11 '23 at 15:49

score 1 · Accepted Answer · answered Apr 11 '23 at 15:43

DISCLAIMER: consider this as an exercise. If you're planning to do something like this in real world development - please don't. Use HTML parser instead.

Since you basically need two different changes: convert good <pre> to <p> and remove bad <pre> let's do in in two steps:

string input = "<pre>one</pre><p><b>two</b></p>\n<pre>DO    NOT    MATCH</pre><pre>BALLS</pre>";

Regex regex_replace = new Regex(@"<pre>((?:(?<!\s{3})(?!</?pre>)[^\n])+?)</pre>");
Regex regex_delete = new Regex(@"<pre>[^\n]*?</pre>");

string result = regex_delete.Replace(regex_replace.Replace(input, "<p>$1</p>\n"), "");
Console.WriteLine(result);

Output:

<p>one</p>
<p><b>two</b></p>
<p>BALLS</p>

Here regex_replace is used to replace good <pre> with <p>. It matches <pre> that don't contain other pre or three subsequent whitespace symbols.

And regex_delete removes all other pre's.

Amessihel · Answer 2 · 2023-04-11T16:00:57.563

If you're in full control of the HTML input, you can use this regex:

<pre>((?:[^<\s]\s?)*)</pre>

(?:[^<\s]\s?)* stands for "a sequence of non-blank characters excepted <, followed by at most one blank space, the whole thing repeated 0 or several times".

This sequence is then captured into the group $1 (Demo).

As said by others, don't use regex to parse regular HTML content, or any stuff not belonging to regular languages.

Which regex to use to retrieve only text without repeated characters?

2 Answers2