-1

I came across a weird behavior of RegEx replacement in .NET. the regEx I had was to replace html tags from a specific string, and only leaving the <br> tags. Below is the HTML I tried with.

<h3 class="a-spacing-mini" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; text-rendering: optimizelegibility; font-size: 17px; line-height: 1.255; font-family: Arial, sans-serif; color: rgb(17, 17, 17); background-color: rgb(255, 255, 255); margin-bottom: 6px !important;">Support Healthy Memory &amp; Hormone Balance*</h3><h5 class="a-spacing-mini a-color-secondary" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; font-size: 13px; line-height: 19px; font-family: &quot;Amazon Ember&quot;, Arial, sans-serif; background-color: rgb(255, 255, 255); margin-bottom: 6px !important; color: rgb(136, 136, 136) !important;">Fight pregnenolone decline</h5><p class="a-spacing-base" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; color: rgb(17, 17, 17); font-family: &quot;Amazon Ember&quot;, Arial, sans-serif; font-size: 13px; background-color: rgb(255, 255, 255); margin-bottom: 14px !important;">Pregnenolone is made from cholesterol in the mitochondria of the adrenal glands and nervous system. Since its levels decline with age, supplementation is recommended.</p><h5 class="a-spacing-mini a-color-secondary" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; font-size: 13px; line-height: 19px; font-family: &quot;Amazon Ember&quot;, Arial, sans-serif; background-color: rgb(255, 255, 255); margin-bottom: 6px !important; color: rgb(136, 136, 136) !important;">The 'mother hormone'</h5><p class="a-spacing-base" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; color: rgb(17, 17, 17); font-family: &quot;Amazon Ember&quot;, Arial, sans-serif; font-size: 13px; background-color: rgb(255, 255, 255); margin-bottom: 14px

below is the VB.NET code I used to replace all tags, except <br>

    Public Shared Function StripHTMLTagsAllowBreakOnly(ByVal text As String) As String
        Dim AcceptableTags As String = "br"
        Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
        text = Regex.Replace(text, WhiteListPattern, "", RegexOptions.Compiled)
        Return text
    End Function

Interestingly, when this method executions with the above mentioned HTML, it never finish calling the RegEx.Replace method.

Does anyone have an idea on this? Is that the specific expression or something else.

I know the HTML is not complete, and it is missing some closing tags, however, I never expected that the RegEx will hang in there forever. (on a live server, it hang for over 24 hours and had to kill the process).

thank you

Sameers Javed
  • 342
  • 2
  • 5
  • 16
  • Looks like you may fix it like `"(?>?)(?!" & AcceptableTags & "\b)[a-zA-Z0-9]+(?:\s+[a-zA-Z0-9-]+(?:=(?:([""']?).*?\1?)?))*\s*/?>"` – Wiktor Stribiżew Sep 12 '19 at 08:18
  • 2
    I would recommend HtmlAgilityPack library for any work that involves parsing HTML. Getting Regex to work with parsing HTML is painful and error-prone. – Ghasan غسان Sep 12 '19 at 09:03
  • 2
    Did someone say [Parse HTML with Regex](https://stackoverflow.com/a/1732454/463623)? Because I heard someone say Parse HTML with Regex! – Euphoric Sep 12 '19 at 09:04

1 Answers1

-1

Seems like it worked with

Dim AcceptableTags As String = "br"
Dim WhiteListPattern As String = ("<(?!" + (AcceptableTags + ")[^>]*(>|$)"))
text = Regex.Replace(text, WhiteListPattern, "", RegexOptions.Compiled)

However, the RegEx should NOT go into an infinite loop (I guess it does, that is why it runs forever).

It makes better sense to throw an error or something, but it doesn't makes sense to stuck somewhere forever.

Thank you guys

csabinho
  • 1,579
  • 1
  • 18
  • 28
Sameers Javed
  • 342
  • 2
  • 5
  • 16