1

I have this piece of HTML:

</TABLE>
<HR>
<font size="+1"> Method and apparatus for re-sizing and zooming images by operating directly
     on their digital transforms
</font><BR>

and I am trying to capture the text inside font tag. This is my Regex:

  Regex regex = new Regex("</TABLE><HR><font size=\"+1\">(?<title>.*?)</font><BR>", RegexOptions.Singleline | RegexOptions.IgnoreCase);

        Match match = regex.Match(data);

        string title = match.Groups["title"].Value;

However I get empty title. Can anybody tell me what am I missing?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Jack
  • 7,433
  • 22
  • 63
  • 107
  • A regex is the wrong tool for this. Regexes cannot parse HTML (or XML) with any degree of reliability. Use an HTML parser, and see [this question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Richard Aug 12 '12 at 11:38
  • @Richard: I understand this. However the website that I want to parse has a fixed structure and so I want to use Regex itself. – Jack Aug 12 '12 at 11:40

1 Answers1

3

Your regex;

new Regex("</TABLE><HR><font size=\"+1\">(?<title>.*?)</font><BR>"

isn't well formed since + has a distinct meaning in regex.

Based on your input string, what you want is really to have it escaped;

new Regex("</TABLE><HR><font size=\"\\+1\">(?<title>.*?)</font><BR>"

Also, if you want to match strings with newlines, you have to give a wildcard to ignore them too, so this may be even more what you're trying to do;

new Regex("</TABLE>.*<HR>.*<font size=\"\\+1\">(?<title>.*?)</font>.*<BR>"
Joachim Isaksson
  • 176,943
  • 25
  • 281
  • 294
  • Thanks. But didn't understand why you did .* for multiline? Wouldn't it match everything when it is RegexOptions.Singleline? – Jack Aug 12 '12 at 12:20
  • @Jack RegexOptions.Singleline only *changes the meaning of the dot (.) so it matches every character (instead of every character except \n).* In other words, you still need to match a linefeed with . or .* to ignore it. – Joachim Isaksson Aug 12 '12 at 12:22