0

I'm working on a movie scraper / auto-downloader that iterates over my current movie collection, finds new recommendations, and downloads the new goods.

There is a part where I scrape IMDb for metadata and it seems to get stuck in this one spot and I can't seem to figure out why.... it has run this same code with different imdb pages just fine (this is the 29th iteration of a new page)

I am using c#!

The code:

    private string Match(string regex, string html, int i = 1)
    {
        return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
    }

regex parameter string contents:

 <title>.*?\\(.*?(\\d{4}).*?\\).*?</title>

html parameter string contents: too big to paste here, but literally the html string representation of http://www.imdb.com/title/tt4422748/combined

if in chrome, you can view easily with:

view-source:http://www.imdb.com/title/tt4422748/combined

I have paused execution in visual studio and stepped forward, it continues to run but just hangs (it doesn't let me step, it just runs). If i hit pause again it will return to the same spot with the same parameter values (and no I am not calling it in an infinite loop. I'm pretty new to Regex so any help would be appreciated!

user3689167
  • 863
  • 1
  • 14
  • 28
  • Tempted to close as duplicate of http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ ... Please make sure to read first 20+ answers so you can write plausible HTML parser with regex... Or just use HtmlAgilityPack and finish all parsing in 10 minutes :) – Alexei Levenkov Apr 23 '15 at 01:27
  • got a good laugh out of that "duplicate"... I was using a free parser I found on the net (which uses regex) so I was hoping I could make it work instead of rewriting everything from scratch. I will look into htmlagilitypack! thanks – user3689167 Apr 23 '15 at 01:30
  • what is the title it gets stuck on? and still - you should be using the HtmlAgilityPack to get to the title regardless if you're using regex to parse that further – Sten Petrov Apr 23 '15 at 01:36

1 Answers1

0

Use of .* is like saying I want to match everything, yet nothing. Each use of it causes the parser to backtrack on so many different possibilities it becomes unresponsive and appears to lock up.

Does the person designing the pattern really not know if there is going to be text there or not for title? I bet 99% of the time the title has text..so why is .* even used, how about .+ at least?

If you want text between the delimiters, use this

title\>(?<Title>[^<]+)\</title

Then extract the matched text through the named group "Title" instead of group[0]. Group[1] will have the actual match text as well if one loathes named match captures.

Answer for Regex Haters

Use the HTML agility pack.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122