0

I need to extract a link from a string using regular expression in C#. I cannot use a substring method since both the letters in the string and the link may vary. This is the link with surrounding letters:

-sv"><a href="http://sv.wikipedia.org/wiki/%C3%84pple" title="

The -sv"><a href=" part must be included in the regex or it won't be specific enough. The end of the regex may be at the quotation markat the end of the link or whichever is the easiest way. I've had another suggestion aswell, however, this does not include the sv-part in the beginning and the submitter couldnt make it compile:

@"]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?[^>]*?>";

Now I'm turning to you guys on stackoverflow. Thanks in advance! Max

user655071
  • 31
  • 1
  • 4
  • Could you give your question a little more context. I suspect there may be other approaches to what you are *really* trying to accomplish. – Simen S Mar 11 '11 at 09:23
  • I've built the program around a regex that picks the link between sv"> – user655071 Mar 11 '11 at 09:42

3 Answers3

0

Try using HTML parser. Source code is very intuitive for learning as well.

Download library, add reference to HtmlAgilityPack.dll. Get all your links with:

    List<string> listOfUrls = new List<string>();
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(@"c:\ht.html");
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//li[@class='interwiki-sv']");
        foreach (HtmlNode li in coll)
        {
            if (li.ChildNodes.Count < 1) continue;
            HtmlNode node = li.ChildNodes.First();
            if (null == node) continue;
            HtmlAttribute att = node.Attributes["href"];
            if (null == att) continue;
            listOfUrls.Add(att.Value);
        }
    //Now, You got your listOfUrls to process.
Typist
  • 1,464
  • 9
  • 14
  • How to install this?`Can't find any instructions. Also, are you certain that I can use the conditions starting with sv-" with this tool? – user655071 Mar 11 '11 at 10:03
  • @user: it's a library which you can refer in your code. You need to play with it in your case to get the result that you want. – Typist Mar 11 '11 at 11:35
  • Have an idea of how I should use it? All I can find is how to treat Html files and how to extract all links from a site. – user655071 Mar 11 '11 at 11:58
0

Check question: Regex to Parse Hyperlinks and Descriptions

Community
  • 1
  • 1
pirho
  • 2,972
  • 4
  • 23
  • 17
  • I'm sorry, I'm e complete newbie. I need everything typed out and ready to use or I don't know what to do x) – user655071 Mar 11 '11 at 10:01
0

Parsing stuff out of html with regex is fraught with danger. Please see this classic answer which explains this with force and humour.

The problem with your question is that we don't know the context.

  • Are your sure the same substring won't appear twice?
  • Are you sure there won't be extra whitespace?
  • Are you sure the html will be valid? (i.e., they could forget to use "", or use '' instead)
  • Are you sure they won't put the title before the href?

There are lots of ways to get it wrong...


However, to answer your question, this regex pattern will work for the exact string you have pasted:

 -sv"><a href="([^"]+)"

However, you won't be able to do a replace directly with that. Note the (), this is a regex capture. I'd recommend looking that up yourself, that way you won't be a newbie forever :)

Community
  • 1
  • 1
Benjol
  • 63,995
  • 54
  • 186
  • 268
  • I can't fit the whole string since it is the source code of a wikipedia page. The same substring will never appear twice. There will be no more whitespace. It will be valid. No there is no title before the href. The string is the source code of this page: http://en.wikipedia.org/wiki/Apple – user655071 Mar 11 '11 at 10:18
  • @user, ok. However, my answer still holds. Sorry. – Benjol Mar 11 '11 at 10:21
  • okay, I will try to use a parser instead. Thank you for your help! – user655071 Mar 11 '11 at 10:24