-1

I'm trying to match and replace anchor tags using a regex. What i have so far is this:

"(<a href=['\"]?([\\w_\\.]*)['\"]?)"

The problem with this approach is that it fails to capture hrefs that also have # in their value. I've tried

"(<a href=['\"]?([\\w_\\.#]*)['\"]?)"

and

"(<a href=['\"]?([\\w_\\.\\#]*)['\"]?)"

with no success.

What am i doing wrong?

Thank you

scripni
  • 2,144
  • 2
  • 19
  • 25
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Martijn Mar 23 '11 at 09:39

3 Answers3

3

I don't think the problem is with # (works fine for me) but with missing other url characters, such as -, /, : etc.

How about a regex like this:

<a href=("[^"]+"|'[^']+'|[^ >]+)

Note: If possible, use other parsing DOM methods for valid html.

Czechnology
  • 14,832
  • 10
  • 62
  • 88
  • Thanks. I want to use a html parser and not Regex, but this is for a client-side silverlight application, so i won't have access to those assemblies. I think i'll develop a web service that will do the parsing remotely for me, to have access to the full .NET platform (and use a DOM parser). – scripni Mar 23 '11 at 11:43
1
<a href=(('|")[^\2]+?\2|[^>]+)
Gursel Koca
  • 20,940
  • 2
  • 24
  • 34
  • This won't work well if the url is enclosed in `'` `'`. Or if the url is not enclosed in quotes at all (not correct xhtml but is seems the OP is trying to match such links too). – Czechnology Mar 23 '11 at 10:34
  • You should also have a space in the negated list (for the case with no quotes and more attributes). The problem with this regex is that if the url contains the other quote or `>` (non-escaped) it's going to end prematurely. That's why I used that ugly-looking list-like regex. – Czechnology Mar 23 '11 at 11:17
  • your solution also suffers from the same problem. – Gursel Koca Mar 23 '11 at 11:28
  • It does? Maybe I've overlookedsomething but I can't see it. Could you please give me a valid example where it fails? – Czechnology Mar 23 '11 at 11:54
  • But that's not a valid html tag. This would be parsed wrong even in the browser. What I meant before were links like ``. – Czechnology Mar 23 '11 at 13:38
  • I believe this one does not have any flaw. :) but it did not look nice as previous one.. :) I have just read last week, www.regular-expressions.info/tutorial.html .. – Gursel Koca Mar 23 '11 at 14:08
1

If you just want to replace the anchor part use string operations. They are simpler and faster

var parts = "http://someurl.com#hashpart".Split("#");
// yields "http://someurl.com" and "hashpart" as array.
// you may want to check if the result has length of two
// if it does :
var newUrl = string.Format("{0}#{1}" parts[0], "some replacement for hashpart");

If your URL contains multiple hashes try using string.Substring to split at the first hashtag.

var url = "http://someurl.com#hash#hashhash";
var hashPos = url.IndexOf("#");
var urlPart = url.Substring(hashPos);
var hashPart = url.Substring(hashPos +1, url.length - hashPos -1);

Should work, wrote it without verification, maybe you have to toss around some +/- 1 to get the right positions.

Zebi
  • 8,682
  • 1
  • 36
  • 42