2

I have data with several occurrencies of the following string:

<a href="default.asp?itemID=987">

in which the itemID is always different. I am using C# and I want to get all those itemIDs with a Regular Expression.

At first I tried this

"<a href=\"default.asp?itemID=([0-9]*)\">"

But the questionmark is a reserved character. I considered using the @ operator to disable escaping of characters. But there are still some double quotes that really need escaping. So then I would go for

"<a href=\"default.asp\\?itemID=([0-9]*)\">"

which should be translated (as a string) to

<a href="default.asp\?itemID=([0-9]*)">

But the Regex.Match method gets no success. I tried the very same regex here and it worked. What am I doing wrong?

Johan
  • 74,508
  • 24
  • 191
  • 319
ckonig
  • 1,234
  • 2
  • 17
  • 29
  • 2
    "I want to get all those itemIDs with a Regular Expression." You shouldn't. Use HTMLAgilityPack instead. http://htmlagilitypack.codeplex.com/ – David Brabant May 30 '12 at 14:53
  • This never gets old: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Bridge May 30 '12 at 14:54
  • After all these discussions people still continue parsing HTML with Regex.. – Tigran May 30 '12 at 14:54
  • Those guys are right, you know. You shouldn't parse HTML with Regex. That said, I see no reason why one shouldn't *both* suggest another alternative and try to help you out. – GregRos May 30 '12 at 14:59
  • 1
    Wow, seems like i hit on a sensible nerve here. But i am not parsing some random HTML where everything can occur. I just need to replace some links that have been created by a WYSIWYG editor and are stored in a database. – ckonig May 31 '12 at 06:52

3 Answers3

11

? and . are special chars for a regex, but can't be escaped "as is" in a string litteral. So if you put one \, it will be wrong for a string, and if you don't put \\, it will be taken as the "special char" of the regexp. So :

"@<a href=\"default\\.asp\\?itemID=([0-9]*)\">";
Raphaël Althaus
  • 59,727
  • 6
  • 96
  • 122
7

When using the @operator, you can regain double quotes with "".

You also need to escape certain special chars in the regex, in this case, the chars .\?

Try this:

@"<a href=""default\.asp\?itemID=([0-9]*)"">"
spender
  • 117,338
  • 33
  • 229
  • 351
  • One less than supposed. Because \ is the escape char, to get an actual backslash, you need \\. `.` and `?` both have special meaning and thus also need to be escaped. – spender May 30 '12 at 14:59
  • Still too many blackslashes at the ? mark. This is a @ string ya know. Oh, I'm pretty sure you need to write `\.` in .NET regex, though I could be mistaken. – GregRos May 30 '12 at 15:01
  • So... in the middle of the string we're looking to **exactly** match \? right? How does that escape? to my mind (and that of my test), \ becomes \\ and ? becomes \? giving \\\? . Am I wrong? – spender May 30 '12 at 15:04
  • OP says the text is `` – GregRos May 30 '12 at 15:05
1

Try escaping the dot '.' character with \.

GregRos
  • 8,667
  • 3
  • 37
  • 63