2

I am stuck at a simple problem.

I am using RegEx to extract url's from html markup. I want to add constant prefix

"The site is"

to the extracted RegEx group.

Sample markup:

<html>
  <body>
    <a href="www.stackoverflow.com"></a>
  </body>
</html>

and the expression I am using is:

<a\shref="(?<Url>.*?)"></a>

Currenlty I am getting group Url as

www.stackoverflow.com

but I want that as

The site is www.stackoverflow.com

How can I get it?

mmdemirbas
  • 9,060
  • 5
  • 45
  • 53
Agent007
  • 2,720
  • 3
  • 20
  • 24
  • 1
    Can't you simply concatenate "The site is " with group result value? – Marco Aug 23 '12 at 07:13
  • @Marco I thought that would be great if I can get it done in RegEx itself. – Agent007 Aug 23 '12 at 07:14
  • 1
    IMHO Regex should be used to extract values from a complex string using rules: presentation of the result is made after... – Marco Aug 23 '12 at 07:15
  • Similar to [Regex: Named Capturing Groups in .NET](http://stackoverflow.com/a/906847/471214) – mmdemirbas Aug 23 '12 at 07:56
  • @mmdemirbas My question is not at all related to how to extract hrefs from anchors, I want to prefix some string to extracted RegEx group, not from code but in RegEx itself, hope you get it ! :) – Agent007 Aug 23 '12 at 08:17
  • Yes, I meant this: `link = regex.Match(input).Result("The site is ${Url}");` – mmdemirbas Aug 23 '12 at 08:23

2 Answers2

2
Regex  regex  = new Regex(@"<a\shref=""(?<Url>.*?)""></a>")
String input  = ... // your sample markup
String result = regex.Match(input).Result("The site is ${Url}");
mmdemirbas
  • 9,060
  • 5
  • 45
  • 53
  • The same I want to achieve with RegEx, I mean the RegEx match itself should give the result (what result in your snippet will contain) and not the C#. – Agent007 Aug 23 '12 at 08:33
  • 2
    As I know, this is impossible. Do you have any strong reason to want so? I am really wondering. – mmdemirbas Aug 23 '12 at 08:42
  • Actually, I am writing a small crawler, so some websites give absolute anchors with domain name included (e.g. www.xyz.com/p/q.aspx) and some give relative anchors (e.g /p/q.aspx). So in case of relative ones, I am thinking to obtain absolute hyperlinks (i.e. with domain name included). :) – Agent007 Aug 23 '12 at 08:55
  • 1
    Ok, you should use C# with regex. Bare regex will disappoint you at this point. – mmdemirbas Aug 23 '12 at 09:00
  • actually, I was thinking C# as the last resort, thanks anyway ! – Agent007 Aug 23 '12 at 09:02
1

Brief answer: don't parse HTML using regex. In depth answer

Community
  • 1
  • 1
EthanB
  • 4,239
  • 1
  • 28
  • 46
  • Even if your suggestion is correct, it doesn't solve OP question. Maybe this should be a comment... – Marco Aug 23 '12 at 07:17
  • The answer is "don't do it" -- the same as the correct answer for the linked question. – EthanB Aug 23 '12 at 07:22
  • No, then answer is not that, but it's my opinion. I try to explain what I mean. OP wants to extract an URL from a webpage and then wants to have "The site is " concatenated with that url. It doesn't matter (here) how OP extracts that url (Regex, HtmlAgilityPack or anything else), he has to join these two strings in some way. Even if OP decides to use HtmlAgilityPack (born exactly to accomplish these kinds of tasks) he has to concatenate strings after extraction. Do you agree? – Marco Aug 23 '12 at 07:28
  • Perhaps something along [these lines](http://stackoverflow.com/questions/122856/parse-html-links-using-c-sharp)? – EthanB Aug 23 '12 at 07:35
  • Using regex in this example perfectly valid. He doesn't do any **parsing** (extracting doesn't count). He does **extract** text after ` – mmdemirbas Aug 23 '12 at 07:45
  • Actually, I would prefer to search for `href=` instead of ` – mmdemirbas Aug 23 '12 at 07:48
  • It is also a link technically :) – mmdemirbas Aug 23 '12 at 07:51