How to extract a tag link using regular expression (REGEX - C#)

Question

I have so far this:

<a href="(http://www.imdb.com/title/tt\d{7}/)".*?>.*?</a>

c#

ArrayList imdbUrls = matchAll(@"<a href=""(http://www.imdb.com/title/tt\d{7}/)"".*?>.*?</a>", html);
private ArrayList matchAll(string regex, string html, int i = 0)
{
  ArrayList list = new ArrayList();
  foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
    list.Add(m.Groups[i].Value.Trim());
  return list;
}

I'm trying to extract imdb link from an HTML page what is wrong with this regex expression?

The main idea of this is to search in google for a movie and then look for a link to imdb in the results

I have no idea about `c#` but the inner `""` looks funny there to me... — arkascha, Nov 13 '12 at 12:38
Then why are the enclosing (outer) `"` not escaped the same way? Those are meant to be regex delimiters? — arkascha, Nov 13 '12 at 12:40
Maybe the slashes (`/`) have to be escaped. I suggest you try a regex vaidator. Good ones explain details of the matching process. — arkascha, Nov 13 '12 at 12:42
Please show the line of C# code that uses this regex. It is likely a character escaping issue. You may have to double backslashes for them to get passed to the regex engine. Also note that `.` matches any character, so you need to escape it to match a literal period. — dan1111, Nov 13 '12 at 12:49
You can try HAP (http://htmlagilitypack.codeplex.com) I suppose — ChruS, Nov 13 '12 at 12:52
Use an HTML parser, not regex. http://stackoverflow.com/a/1732454/399649 — Justin Morgan - On strike, Nov 13 '12 at 12:53

Anirudha · Accepted Answer · 2015-02-15T02:04:29.970

1

Regex is not a good choice for parsing HTML files.HTML is not strict nor is it regular with its format.

Use htmlagilitypack.You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

List<string> anchorImdbList = doc.DocumentNode.SelectNodes("//a[@href]")//this xpath selects all anchor tags
                  .Select(p => p.Attributes["href"].Value)
                  .Where(x=>Regex.IsMatch(x,@".*?www\.imdb\.com.*?"))
                  .Select(y=>y)
                  .ToList<string>();

edited Feb 15 '15 at 02:04

answered Nov 13 '12 at 13:03

Anirudha

32,393
7
68
89

It dosen't work... Value cannot be null. Parameter name: source I'm trying to parse: http://www.google.com/search?q=imdb+The Awakening 2011 Une – Alex K Nov 13 '12 at 14:17
@AlexKapustian this may be becuz some anchor tag may be without `href`..try the edit – Anirudha Nov 13 '12 at 14:53

Ali Vojdanian · Answer 2 · 2012-11-13T12:58:38.097

0

Try this :

string tag = "tag of the link";
string emptystring = Regex.Replace(tag, "<.*?>", string.Empty);

Update :

string emptystring = Regex.Replace(tag, @"<[^>]*>", string.Empty);

edited Nov 13 '12 at 12:58

answered Nov 13 '12 at 12:50

Ali Vojdanian

2,067
2
31
47

I think that this won't work cause I need to extract the link from a page that has many tags like this <> – Alex K Nov 13 '12 at 12:57

score 0 · Answer 3 · answered Nov 13 '12 at 12:57

0

You must escape the forward slashes. Try:

<a href="(http:\/\/www.imdb.com\/title\/tt\d{7}\/)".*?>.*?<\/a>

If you need to parse out html elements from a complex page, regexes will be very cumbersome. Try the Html Agility Pack as others have suggested.

answered Nov 13 '12 at 12:57

PHeiberg

29,411
6
59
81

How to extract a tag link using regular expression (REGEX - C#)

3 Answers3