1

I’m having a problem with Regular Expressions in C#. What I have is a string representing a page (HTML etc.). The string also contains \r\n, \r and \n in different places, now I’m trying to match something in the string:

Match currentMatch = Regex.Match(contents, "Title: <strong>(.*?)</strong>");
string org = currentMatch.Groups[1].ToString();

This works fine, however, when I want to match something that has any of the characters mentioned earlier (line breaks) in the string, it doesn’t return anything (empty, no match):

Match currentMatch = Regex.Match(contents, "Description: <p>(.*?)</p>");
string org = currentMatch.Groups[1].ToString();

It does however work if I add the following lines above the match:

contents = contents.Replace("\r", " ");
contents = contents.Replace("\n", " ");

I however don’t like that its modify the source, what can I do about this?

Johan Svensson
  • 863
  • 3
  • 9
  • 23

1 Answers1

1

The . does not match newline characters by default. You can change this, by using the Regex Option Singleline. This treats the whole input string as one line, i.e. the dot matches also newline characters.

Match currentMatch = Regex.Match(contents, "Title: <strong>(.*?)</strong>", RegexOptions.Singleline);

By the way, I hope you are aware that regex is normally not the way to deal with Html?

stema
  • 90,351
  • 20
  • 107
  • 135
  • Hello, and thanks a lot. What is the better way to deal with HTML? I've always used Regular Expressions in other languages aswell. Thanks – Johan Svensson Jan 22 '13 at 08:00
  • Use a HTML parser, see for example this question: [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) – stema Jan 22 '13 at 08:06