I came across a weird behavior of RegEx replacement in .NET.
the regEx I had was to replace html tags from a specific string, and only leaving the <br>
tags. Below is the HTML I tried with.
<h3 class="a-spacing-mini" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; text-rendering: optimizelegibility; font-size: 17px; line-height: 1.255; font-family: Arial, sans-serif; color: rgb(17, 17, 17); background-color: rgb(255, 255, 255); margin-bottom: 6px !important;">Support Healthy Memory & Hormone Balance*</h3><h5 class="a-spacing-mini a-color-secondary" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; font-size: 13px; line-height: 19px; font-family: "Amazon Ember", Arial, sans-serif; background-color: rgb(255, 255, 255); margin-bottom: 6px !important; color: rgb(136, 136, 136) !important;">Fight pregnenolone decline</h5><p class="a-spacing-base" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; color: rgb(17, 17, 17); font-family: "Amazon Ember", Arial, sans-serif; font-size: 13px; background-color: rgb(255, 255, 255); margin-bottom: 14px !important;">Pregnenolone is made from cholesterol in the mitochondria of the adrenal glands and nervous system. Since its levels decline with age, supplementation is recommended.</p><h5 class="a-spacing-mini a-color-secondary" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; font-size: 13px; line-height: 19px; font-family: "Amazon Ember", Arial, sans-serif; background-color: rgb(255, 255, 255); margin-bottom: 6px !important; color: rgb(136, 136, 136) !important;">The 'mother hormone'</h5><p class="a-spacing-base" style="box-sizing: border-box; margin-top: 0px; margin-right: 0px; margin-left: 0px; padding: 0px; text-size-adjust: 100%; color: rgb(17, 17, 17); font-family: "Amazon Ember", Arial, sans-serif; font-size: 13px; background-color: rgb(255, 255, 255); margin-bottom: 14px
below is the VB.NET code I used to replace all tags, except <br>
Public Shared Function StripHTMLTagsAllowBreakOnly(ByVal text As String) As String
Dim AcceptableTags As String = "br"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
text = Regex.Replace(text, WhiteListPattern, "", RegexOptions.Compiled)
Return text
End Function
Interestingly, when this method executions with the above mentioned HTML, it never finish calling the RegEx.Replace
method.
Does anyone have an idea on this? Is that the specific expression or something else.
I know the HTML is not complete, and it is missing some closing tags, however, I never expected that the RegEx will hang in there forever. (on a live server, it hang for over 24 hours and had to kill the process).
thank you