1

I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.

I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm

There's a lot of code however this is the main part that I am struggling with. The code block looks like this:

def grab_yearly_data(self,page,year):
    # page is the url that was downloaded, year in this case is 2014.

    rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
    mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
    #mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing

    self.rank= [g for g in re.findall(rank_pattern,page)]
    self.mov_title=[g for g in re.findall(mov_title_pattern,page)]

self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195

3 Answers3

2

Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.

Here is a solution using BeautifulSoup HTML parser:

from bs4 import BeautifulSoup
import requests

url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)

soup = BeautifulSoup(response.content)

for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
    cells = row.find_all('td')
    if len(cells) < 2:
        continue

    rank = cells[0].text
    title = cells[1].text
    print rank, title

Prints:

1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below

The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.


For this page, to get to the table you can rely on the chart element and get it's next table sibling:

for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
    ...
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I tried something similar with daily data from the top 100 movies (http://www.boxofficemojo.com/movies/?page=daily&view=chart&id=hungergames3.htm). For some reason, every week the columns shift by 1. Any advice on whats wrong with this statement?: alltables=soup.findAll("table", {"border":"0", "width":"95%"}) ?? – user3667623 Jan 06 '15 at 06:29
  • 1
    @user3667623 I've updated the answer providing an another option to get to the table with data - check it out. Hope that helps. – alecxe Jan 06 '15 at 13:18
1
mov_title_pattern=r'.htm">([A-Za-z0-9 ]*)</a></font></b></td>'

Try this.This should work for your case.See demo.

https://www.regex101.com/r/fG5pZ8/6

vks
  • 67,027
  • 10
  • 91
  • 124
1

Your regex does not make much sense. It matches .htm">[A-Z] as few times as possible, which is usually zero, yielding an empty string.

Moreover, with a very general regular expression like that, there is no guarantee that it only matches on the result rows. The generated page contains a lot of other places where you could expect to find .htm"> followed by something.

More generally, I would advocate an approach where you craft a regular expression which precisely identifies each generated result row, and extracts from that all the values you want. In other words, try something like

re.findall('stuff (rank) stuff (title) stuff stuff stuff')

(where I have left it as an exercise to devise a precise regular expression with proper HTML fragments where I have the stuff placeholders) and extract both the "rank" group and the "title" group out of each matched row.

Granted, scraping is always brittle business. If you make your regex really tight, chances are it will stop working if the site changes some details in its layout. If you make it too relaxed, it will sometimes return the wrong things.

tripleee
  • 175,061
  • 34
  • 275
  • 318