There is a number of web links on the page I am parsing, I want to capture all h3 links except those with a specific sub-elements in them.
Example page:
<h3 class="r">
<a href="http://Capture This"></a>
</h3>
<some tags here>
<more tags here>
<a bit more tags here>
</div>
</div>
<h3 class="r">
<a href="http://Capture This"></a>
</h3>
<some tags here>
<class=ml>
<more tags here>
<class=tcl>
<a bit more tags here>
</div>
</div>
<h3 class="r">
<a href="http://Dont capture this"></a>
</h3>
<some tags here>
<class=ml>
<more tags here>
<a bit more tags here>
</div>
</div>
Capture h3 links:
- Which do not contain class=ml sub-element
- Or contain class ml and class=tcl elements
This regex matches all h3 links:
h3 class=["']?[^"']+["']?><a href=["']?(https?://[^"']+)["']?
This regex matches all h3 links without class=ml in their sub-elements (each h3 element is separated with doulbe /div tag):
h3 class=["']?r["']?><a href=["']?(https?://[^"']+)["']?(?=((?!class=ml).)*(</div>){2,})
Finally, regex that I am looking for captures all h3 links, which do not contain class=ml, or contain both class=ml and class=tcl in their sub-elements (in this order)
h3 class=["']?[^"']+["']?><a href=["']?(https?://[^"']+)["']?(?=((?!class=ml)(?!(</div>){2,}).)*(class=ml((?!class=tcl>).)*class=tcl>|(</div>){2,}))
Regex, which I came with, works, but very inefficiently due to backtracking. For example 100 iterations of standard page matching take 50 seconds to complete. Is there any way to improve this regex, so it wouldn't backtrack that much?
I believe I start to understand what HTML parsing Cthulhu way actually is, but hopefully it won't disturb anyone's sleep.
P.S. I am on.NET regex engine, if that influences my options.