1

How can I replace

<a href="page">Text</a>

with

<a href="page.html">Text</a>

where page and Text can be any set of characters?

Justin808
  • 20,859
  • 46
  • 160
  • 265

2 Answers2

1

You shouldn't parse HTML with regular expressions. See the answer to this question for details.

UPD: As TrueWill has pointed out, you might want to do the replace with Html Agility Pack. But in some special cases the regexp proposed by FailedDev will do, although I would slightly modify it to look like this: @"(?<=<a\b[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)" (put a \b after the <a to exclude other tags starting with "a").

Community
  • 1
  • 1
Gebb
  • 6,371
  • 3
  • 44
  • 56
  • I'm not trying to parse the HTML, I'm trying to do a string replace in a html file. – Justin808 Nov 04 '11 at 17:43
  • One simple regex would be `(.*?)` to find the parts. – jCoder Nov 04 '11 at 17:45
  • 2
    @Justin808 But to do it correctly, you actually need to parse the document. For example, you will probably want to ignore scripts and comments. – Gebb Nov 04 '11 at 17:46
  • @Gebb is correct. Any changes to HTML, particularly those affecting only a specific context (such as in an HREF), involve parsing. Take a look at http://htmlagilitypack.codeplex.com/ – TrueWill Nov 04 '11 at 17:46
1

This will work. Note that I only capture whatever is inside href.

resultString = Regex.Replace(subjectString, @"(?<=<a[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)", "$2.html");

And append the .html to it. You may wish to change it to your needs.

Edit : before flame wars begin. Yes it will work for your specific example not for all possible html in the internet.

FailedDev
  • 26,680
  • 9
  • 53
  • 73