0

I am trying to replace a string in HTML. I am only interested in "true" text (textContent). That is, no attributes should be touched, just the text.

I came up with an expression that is not perfect yet:

var hayStack = @"<p class='33333'> 33333 <a href='33333'> 33333 </a> After 33333 </p> <div id='33333'></div>";
string pattern = @"(?x)(?<=>.*?) 33333 (?=.*?<)";
Console.WriteLine(Regex.Replace(hayStack, pattern, "Replaced")); 

That prints:

<p class='33333'> Replaced <a href='Replaced'> Replaced </a> After Replaced </p> <div id='Replaced'></div>

It appears that the expression works correctly in some cases. It does handle text content, but it breaks when dealing with attributes.
It should print:

<p class='33333'> Replaced <a href='33333'> Replaced </a> After Replaced </p> <div id='33333'></div>

How would the correct expression look?

GSerg
  • 76,472
  • 17
  • 159
  • 346
Robert Segdewick
  • 543
  • 5
  • 17
  • 6
    Use HTMLAgilityPack. Don't use regex to parse html. –  Nov 18 '19 at 17:38
  • I imagine creaing a full fledged node tree is much more performance heavy than regex search. I really need replacing to be performant, because it is happening very often. – Robert Segdewick Nov 18 '19 at 17:41
  • 2
    Please see https://stackoverflow.com/a/1732454/11683. – GSerg Nov 18 '19 at 17:41
  • @Gserg What have I done... – Robert Segdewick Nov 18 '19 at 17:52
  • Can you get the regex to work? Probably, but (1) in a few months, you won't be able to understand the regex, and (2) you'll likely run into input that breaks the regex. Depending on how much variation exists in the input, you might not ever get all the bugs worked out. –  Nov 18 '19 at 17:59
  • Performance is measured with a profiler, not by what you “imagine”. Regular expression matches, especially those using backtracking as is commonly needed to match nested expressions, are also notoriously slow. Get your code working first using the HtmlAgilityPack. – Dour High Arch Nov 18 '19 at 18:19
  • `.*?` matches any chars, 0 or more. You need `[^<>]*?` – Wiktor Stribiżew Nov 18 '19 at 21:47

0 Answers0