The site I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?yr=2015&p=.htm. This site has a list of movies, and for each movie, I want to get the following information in the table, excluding the Dates.
I am having trouble with this because the text doesn't have links or any class tags. I tried using multiple methods already, but none of them are working.
This is one method I have so far, just to get the ranks for each movie. I want the output to be just an list of lists made up of each movie's rank, then another list that has lists of each movies, weekend gross, etc.
listOfRanks = [[1, 1, 1,], [1, 2, 3], [3, 5,1]], etc.
listOfWeekendGross = [[208,806,270,106588440,54200000], [111111111, 222222222, 333333333]]
def getRank(item_url):
href = item_url[:37]+"page=weekend&" + item_url[37:]
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
rank = soup.select('tbody > tr > td > center > table > tbody > tr > td > font')
print rank
This is where I call this function -
def spider(max_pages):
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(max_pages) + '&view=releasedate&view2=domestic&yr=2015&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select('td > b > font > a[href^=/movies/?]'):
href = 'http://www.boxofficemojo.com' + link.get('href')
getRank(href)
The problem is that the getRank(href) method is not adding the ranks correctly to the list. The problem is with this line I think -
rank = soup.select('tbody > tr > td > center > table > tbody > tr > td > font')
This is probably not the right way to get this text.
How can I get all the ranks, weekend gross, etc. from this site?
+++++++++++++++++++++++++++++++++