1

There is a number of web links on the page I am parsing, I want to capture all h3 links except those with a specific sub-elements in them.

Example page:

<h3 class="r">
    <a href="http://Capture This"></a>
</h3>
   <some tags here>
   <more tags  here>
   <a bit more tags here>
   </div>
</div>
<h3 class="r">
   <a href="http://Capture This"></a>
</h3>
<some tags here>
  <class=ml>
  <more tags here>
  <class=tcl>
  <a bit more tags here>
  </div>
</div>
<h3 class="r">
  <a href="http://Dont capture this"></a>
</h3>
  <some tags here>
  <class=ml>
  <more tags here>
  <a bit more tags here>
  </div>
</div>

Capture h3 links:

  • Which do not contain class=ml sub-element
  • Or contain class ml and class=tcl elements

This regex matches all h3 links:

h3 class=["']?[^"']+["']?><a href=["']?(https?://[^"']+)["']?

This regex matches all h3 links without class=ml in their sub-elements (each h3 element is separated with doulbe /div tag):

h3 class=["']?r["']?><a href=["']?(https?://[^"']+)["']?(?=((?!class=ml).)*(</div>){2,})

Finally, regex that I am looking for captures all h3 links, which do not contain class=ml, or contain both class=ml and class=tcl in their sub-elements (in this order)

h3 class=["']?[^"']+["']?><a href=["']?(https?://[^"']+)["']?(?=((?!class=ml)(?!(</div>){2,}).)*(class=ml((?!class=tcl>).)*class=tcl>|(</div>){2,}))

Regex, which I came with, works, but very inefficiently due to backtracking. For example 100 iterations of standard page matching take 50 seconds to complete. Is there any way to improve this regex, so it wouldn't backtrack that much?

I believe I start to understand what HTML parsing Cthulhu way actually is, but hopefully it won't disturb anyone's sleep.

P.S. I am on.NET regex engine, if that influences my options.

Civa
  • 2,058
  • 2
  • 18
  • 30
d3delux
  • 27
  • 2
  • 2
    I would recommend to simply get a list of all the h3 links and the class attributes of their inner tags and then you can filter them to get your desired result. – Matthias Apr 14 '13 at 23:00
  • Good suggestion, and is most logical. However, I am restricted with using only regex. – d3delux Apr 17 '13 at 16:34
  • 3
    Throw away your regular expressions and download the HTML Agility Pack. Life is way too short to try any kind of serious HTML parsing with regular expressions. – Jim Mischel Apr 17 '13 at 16:36
  • I'm not sure why you are restricted to using regex only, but that's not a great choice for parsing HTML. You can easily traverse the DOM with .net, which is much less complicated and will perform much better. – Brian Stephens May 02 '13 at 16:23
  • http://stackoverflow.com/a/1732454/213550 – VMAtm Jul 17 '14 at 13:24

0 Answers0