0

I have been using regexpal to test my regular expressions, and can't understand why the one I'm testing now is failing.

I've consulted several regex tutorials and references, and still don't see anything that would explain why I'm encountering these problems.

The regex I'm testing is:

(<p>\s*(?:(?:<font[^>]*>)*?(?:<a[^>]*>)*?(?:<strong[^>]*>)*?(?:</font>)*?(?:</a>)*?(?:</strong>)*?[^<^>]*)*</p>)?\s*<ul>(.*?)</ul>

The data that works is:

<p><font size="1" face="Verdana, Arial, Helvetica, sans-serif"><a href="#test1">test1</a> | <a href="#test2">test12</a></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif"><font size="2"><font face="Verdana, Arial, Helvetica, sans-serif"><font size="2"><strong>Production </strong><a name="prodSupport"></a></font></font></font></font><font face="Verdana, Arial, Helvetica, sans-serif"><strong><font size="2">stuff</font></strong> </font><a name="art"></a></p>
            <ul>
                <li><span style="font-family: Arial"><font size="1"><a id="Assistants" href="Assistants.aspx" name="Assistants">Assistants</a></font></span><font size="1"><a id="Assistants" href="Assistants.aspx" name="Assistants"></a></font></li>
</ul>

And the data that doesn't work is:

<p><font size="1" face="Verdana, Arial, Helvetica, sans-serif"><a href="#test1">test1</a> | <a href="#test2">test123</a></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif"><font size="2"><font face="Verdana, Arial, Helvetica, sans-serif"><font size="2"><strong>Production </strong><a name="prodSupport"></a></font></font></font></font><font face="Verdana, Arial, Helvetica, sans-serif"><strong><font size="2">stuff</font></strong> </font><a name="art"></a></p>
            <ul>
                <li><span style="font-family: Arial"><font size="1"><a id="Assistants" href="Assistants.aspx" name="Assistants">Assistants</a></font></span><font size="1"><a id="Assistants" href="Assistants.aspx" name="Assistants"></a></font></li>
</ul>

Why would "test12" work and "test123" not? I'm thoroughly confused.

  • You might want to post what you are trying to transform (what's the starting text and desired result)? It's easier to write a regex from scratch than pinpoint a problem with a long one like that. :) – Kevin Seifert Dec 21 '13 at 01:46
  • The starting point is: http://www.coj.net/departments/office-of-economic-development/film-and-television/production-guide/production-guide-listings.aspx#prodSupport and the desired result it to capture the sometimes present category headings (e.g. "Production Art/Props"). The regex is run against the results of the following regex: \s*(?:.*?)(?:]*>)?(.*?)(?:)?(?:.*?)(.*?) – user3085196 Dec 21 '13 at 01:51
  • It's been asked before, but why aren't you parsing this with a proper HTML/DOM parser? – brandonscript Dec 21 '13 at 01:53
  • Regexes are more reliable over time than treating it as an XML document and using LINQ. If they add a menu or banner or anything the XML solution breaks whereas the regex search will work until they fundamentally change the format of the data I'm after. – user3085196 Dec 21 '13 at 01:59
  • Neither of those strings match when I try your regex. – Vasili Syrakis Dec 21 '13 at 02:01
  • You have to enable "dot matches all" – user3085196 Dec 21 '13 at 02:04
  • @user3085196 Using a [proper *HTML* parsing library](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) is more reliable over time than treating it as dumb text and using regexes. Some libraries allow CSS selectors or XPath-like queries - used as roots for LINQ operations, these are *very* powerful ways to navigate HTML. If the site changes so fundamentally much that it must be rewritten for one approach then it must be rewritten for the other anyway: however, unlike complicated minutia regular expressions, higher-order extractors will be maintainable. – user2864740 Dec 21 '13 at 19:05

2 Answers2

0

I'd avoid scraping someone's site if at all possible (ideally you want to pull a data feed).

Otherwise, if you are just pulling links from: http://www.coj.net/departments/office-of-economic-development/film-and-television/production-guide/production-guide-listings.aspx#prodSupport

... I'd just scrape only the <strong> (or any single tag that you are interested in). If you end up with a little junk, just manually remove the data you don't want. A complex regex will be very brittle and will break when they update the css or slightly tweak the page layout.

Kevin Seifert
  • 3,494
  • 1
  • 18
  • 14
  • Regardless of whether its a good idea or not I'd like to learn the regex mechanics that make one match and the other not – user3085196 Dec 21 '13 at 02:21
  • But it does not look like the site uses consistent HTML formatting. Some headers are larger than others. If they are editing this page by hand, there may not even be a pattern to match. – Kevin Seifert Dec 21 '13 at 02:25
  • If the regex would work consistently, it would match everything on the site currently. It looks like I'm going to have to enumerate the tags in each table row, then with a dom parser find the element before the – user3085196 Dec 21 '13 at 02:43
0

I've moved to using the slightly more error prone:

(<p>(?:(?!</p>).)*</p>)?\s*<ul>(.*?)</ul>

Lets me continue the job at least.

I check the first capture group to see if its an empty string, and if not I go:

input = Regex.Replace(input, "<[^>]*>", "")

to strip the tags and leave me with the category text. Quick, efficient, even if its a little dirty.