2

I want to extract the movie name of each row in the IMDb`s Boxoffice table..

example html table row:

    <tr class="chart_even_row">

  <td style="text-align: right;">
      <b>1</b>
  </td>
  <td>
      <img border="0" src="http://ia.media-imdb.com/images/M/MV5BMjA4NDg3NzYxMF5BMl5BanBnXkFtZTcwNTgyNzkyNw@@._V1._SY30_SX23_.jpg" width="20" height="30">
  </td>
  <td>

<a  href="/title/tt1392170/" >The Hunger Games</a> (2012)
  </td>
  <td style="text-align: right; padding-right: 20px;">$155M
  </td>
  <td style="text-align: right;">
$155M
  </td>
  <td style="text-align: center;">
1
  </td>

</tr>

The value I want to extract is "The Hunger Games"..

I need a C# code that would achieve this for me..

NOTE: I want to do this via REGEX

Thanks in advance, Rashad.

Rashad Ahmad
  • 71
  • 1
  • 1
  • 6

2 Answers2

1

Screen scraping the IMDB is complicated, fragile, and forbidden. The IMDB provides plain-text data files you can use instead at http://www.imdb.com/interfaces

Update

Allow me to reiterate: screen scraping and data mining IMDB.com is in violation of their terms of use.

Regarding Regex: see this answer.

So if you're dead-set on doing this in violation of the IMDB's terms of use, the HTML Agility Pack is probably the best way to go.

Community
  • 1
  • 1
StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
0

try to copy paste the code in single html file. if you have too many pages to fetch then try to write code that will read them through html agility pack.

You can find html agility pack here http://htmlagilitypack.codeplex.com/

Chinook
  • 385
  • 4
  • 13