4

I have following Regex on C# and its causing Error: C# Unrecognized escape sequence on \w \. \/ .

string reg = "<a href=\"[\w\.\/:]+\" target=\"_blank\">.?<img src=\"(?<imgurl>\w\.\/:])+\"";
Regex regex = new Regex(reg);

I also tried

string reg = @"<a href="[w./:]+" target=\"_blank\">.?<img src="(?<imgurl>w./:])+"";

But this way the string "ends" at href=" "-char

Can anyone help me please?

Crazywako
  • 362
  • 2
  • 6
  • 15
  • What would you suggest instead of using regex? There is not much of a choice when trying to get large number of items in a html page. – Michael Hartmann Apr 25 '13 at 22:59
  • 2
    Use something that was designed to parse it, the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) is a common one. It will parse out all of the html tokens and let you take them apart and do whatever you need with them. – Scott Chamberlain Apr 26 '13 at 01:54
  • 1
    I am using HTML Agility pack but because the site I am parsing seems not to be dynamic, I think they make posts by hand. That means that sometimes the website structure changes. So I decided to try regex for those parts which I've seen that have been changing. – Crazywako Apr 26 '13 at 14:36
  • possible duplicate of [Unrecognized escape sequence for path string containing backslashes](http://stackoverflow.com/questions/1302864/unrecognized-escape-sequence-for-path-string-containing-backslashes) – JasonMArcher Feb 05 '15 at 00:03

4 Answers4

11

Use "" to escape quotations when using the @ literal.

Gjeltema
  • 4,122
  • 2
  • 24
  • 31
  • This was the right answer... Just couldn't rate it as solved because it has 15 minutes block to mark it. Marked! Thanks. – Crazywako Apr 26 '13 at 05:45
4

There are two escaping mechanisms at work here, and they interfere. For example, you use \" to tell C# to escape the following double quote, but you also use \w to tell the regular expression parser to treat the following W special. But C# thinks \w is meant for C#, doesn't understand it, and you get a compiler error.

For example take this example text:

<a href="file://C:\Test\Test2\[\w\.\/:]+">

There are two ways to escape it such that C# accepts it.

One way is to escape all characters that are special to C#. In this case the " is used to denote the end of the string, and \ denotes a C# escape sequence. Both need to be prefixed with a C# escape \ to escape them:

string s = "<a href=\"file://C:\\Test\\Test2\\[\\w\\.\\/:]+\">";

But this often leads to ugly strings, especially when used with paths or regular expressions.

The other way is to prefix the string with @ and escape only the " by replacing them with "":

string s = @"<a href=""file://C:\Test\Test2\[\w\.\/:]+"">";

The @ will prevent C# from trying to interpret the \ in the string as escape characters, but since \" will not be recognized then either, they invented the "" to escape the double quote.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Daniel A.A. Pelsmaeker
  • 47,471
  • 20
  • 111
  • 157
3

Here's a better regex, yours is filled with problems:

string reg = @"<a href=""[\w./:]+"" target=""_blank"">.?<img src=""(?<imgurl>[\w./:]+)""";
Regex regex = new Regex(reg);

var m = regex.Match(@"http://www.yahoo.com"" target=""_blank"">http://flickr.com/something.jpg""");

Catches <a href="http://www.yahoo.com" target="_blank"><img src="http://flickr.com/something.jpg". Problems with yours: Forward slashes don't need to be escaped, missing the [ bracket in the img part, putting the ) in the right position in the closing of the group.

However, as has been said many times, HTML is not structured enough to be caught by regex. But if you need to get something quick and dirty done, it will do.

Shlomo
  • 14,102
  • 3
  • 28
  • 43
  • 1
    +1, but you still have several unnecessary backslashes there. In fact, the only ones you really need are in the two occurrences of `\w`. – Alan Moore Apr 26 '13 at 01:57
  • You're correct. Edited to remove most of them. I left in the one before the `.` because otherwise that will match any character instead of just periods, which would obviously be bad... – Shlomo Apr 26 '13 at 14:44
  • 1
    No, that one can go, too. Inside a character class, `.` just matches a dot. – Alan Moore Apr 26 '13 at 22:57
  • Didn't know that. Tested, and you're correct. Edited for correctness. – Shlomo Apr 29 '13 at 15:40
0

Here's the deal. C# Strings recognize certain character combinations as specific special characters to manipulate strings. Maybe you are familiar with inserting a \n in a string to work as and End of Line character, for example? When you put a single \ in a string, it will try to verify it, along with the next character, as one of these special commands, and will throw an error when its not a valid combination. Fortunately, that does not prevent you from using backslashes, as one of those sequences, \\, works for that purpose, being interpreted as a single backslash.

So, in practice, if you substitute every backslash in your string for a double backslash, it should work properly.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156