0

I have so far this:

<a href="(http://www.imdb.com/title/tt\d{7}/)".*?>.*?</a>

c#

ArrayList imdbUrls = matchAll(@"<a href=""(http://www.imdb.com/title/tt\d{7}/)"".*?>.*?</a>", html);
private ArrayList matchAll(string regex, string html, int i = 0)
{
  ArrayList list = new ArrayList();
  foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
    list.Add(m.Groups[i].Value.Trim());
  return list;
}

I'm trying to extract imdb link from an HTML page what is wrong with this regex expression?

The main idea of this is to search in google for a movie and then look for a link to imdb in the results

Alex K
  • 5,092
  • 15
  • 50
  • 77

3 Answers3

1

Regex is not a good choice for parsing HTML files.HTML is not strict nor is it regular with its format.

Use htmlagilitypack.You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

List<string> anchorImdbList = doc.DocumentNode.SelectNodes("//a[@href]")//this xpath selects all anchor tags
                  .Select(p => p.Attributes["href"].Value)
                  .Where(x=>Regex.IsMatch(x,@".*?www\.imdb\.com.*?"))
                  .Select(y=>y)
                  .ToList<string>();
Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • It dosen't work... Value cannot be null. Parameter name: source I'm trying to parse: http://www.google.com/search?q=imdb+The Awakening 2011 Une – Alex K Nov 13 '12 at 14:17
  • @AlexKapustian this may be becuz some anchor tag may be without `href`..try the edit – Anirudha Nov 13 '12 at 14:53
0

Try this :

string tag = "tag of the link";
string emptystring = Regex.Replace(tag, "<.*?>", string.Empty);

Update :

string emptystring = Regex.Replace(tag, @"<[^>]*>", string.Empty);
Ali Vojdanian
  • 2,067
  • 2
  • 31
  • 47
  • I think that this won't work cause I need to extract the link from a page that has many tags like this <> – Alex K Nov 13 '12 at 12:57
0

You must escape the forward slashes. Try:

<a href="(http:\/\/www.imdb.com\/title\/tt\d{7}\/)".*?>.*?<\/a>

If you need to parse out html elements from a complex page, regexes will be very cumbersome. Try the Html Agility Pack as others have suggested.

PHeiberg
  • 29,411
  • 6
  • 59
  • 81