0

I'm struggling for over a day with what I'd though would be an easy thing.

I need to parse a page's HTML to find some structured data.

Here's the test string:

<option value="0794">0794 - SANTA MARIA</option>
<option value="0795">0795 - ALICE COUTINHO</option>
<option value="0800">0800 - T.LARANJEIRAS (CIRCULAR A E B) - VIA T. CARAPINA/J. CAMBURI</option>
<option value="0801">0801 - T. LARANJEIRAS / T. CARAPINA - VIA VALPARAISO / J. LIMOEIRO</option>
<option value="0802">0802 - DIVINOPOLIS / T.LARANJEIRAS VIA CENTRO DA SERRA</option>

And here's the Regex pattern:

^\s+<option value="\d+">(?<linha>\d+) - (?<nome>(.*?))</option>$

When debugging with Visual Studio 2010 it find no matches.

Full code:

var pattern = @"^\s+<option value=""\d+"">(?<linha>\d+) - (?<nome>(.*?))</option>$";
var regex = new Regex(pattern, RegexOptions.Multiline);
var matches = regex.Matches(html)

html is the test string and matches.Count is always 0.

I've already tested on http://regexhero.net/tester/ and on http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx and it works perfectly.

Any help would be appreciated.

Anderson Pimentel
  • 5,086
  • 2
  • 32
  • 54

2 Answers2

2

Using the test string as it presented here it's clear that the problem is in the next part:

\s+

That means one or more symbols, and test string doesn't have whitespace characters before any of the lines. \s* does the trick.

Dmitry Polyanitsa
  • 1,083
  • 9
  • 18
  • That was a copy+paste issue. The existing four spaces became the block ident for StackOverflow and I forgot to add them again. :) The original string has four spaces on the beggining of each line. – Anderson Pimentel Jan 15 '12 at 13:16
2

I see two problems. First, there's the ^\s+ at the beginning of the regex. In Multiline mode, ^ matches the position following a linefeed. \s+ matches one or more whitespace characters. But there aren't any whitespace characters after the linefeeds. If you think there might be space or tab characters at the beginning of the line, you should change the + to *; otherwise, just drop the \s+.

Second, the regex ends with $, which matches just before a linefeed. But when I copied the text from your post, the lines ended with \r\n (carriage-return + linefeed), and you aren't accounting for the \r.

When I change the ^\s+ to ^ and the $ to \r?$, I get five matches. By the way, the second problem is .NET's fault, not yours; $ in multiline mode should match before \r, as detailed here.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • The original string has four spaces on the beggining of the line. When I pasted it here it became the block identation syntax and I forgot to readd them back. Anyway, the `\r?$` did the trick. I knew I was not (that) crazy. Thank you very much! – Anderson Pimentel Jan 15 '12 at 13:24