Python BeautifulSoup webcrawling: Getting text that doesn't have links or class tags

Question

The site I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?yr=2015&p=.htm. This site has a list of movies, and for each movie, I want to get the following information in the table, excluding the Dates.

enter image description here

I am having trouble with this because the text doesn't have links or any class tags. I tried using multiple methods already, but none of them are working.

This is one method I have so far, just to get the ranks for each movie. I want the output to be just an list of lists made up of each movie's rank, then another list that has lists of each movies, weekend gross, etc.

listOfRanks = [[1, 1, 1,], [1, 2, 3], [3, 5,1]], etc.
listOfWeekendGross = [[208,806,270,106588440,54200000], [111111111, 222222222, 333333333]]


def getRank(item_url):
    href = item_url[:37]+"page=weekend&" + item_url[37:]
    response = requests.get(href)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    rank = soup.select('tbody > tr > td > center > table > tbody > tr > td > font')
    print rank

This is where I call this function -

def spider(max_pages):
    url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(max_pages) + '&view=releasedate&view2=domestic&yr=2015&p=.htm'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.select('td > b > font > a[href^=/movies/?]'):
        href = 'http://www.boxofficemojo.com' + link.get('href')
        getRank(href)

The problem is that the getRank(href) method is not adding the ranks correctly to the list. The problem is with this line I think -

    rank = soup.select('tbody > tr > td > center > table > tbody > tr > td > font')

This is probably not the right way to get this text.

How can I get all the ranks, weekend gross, etc. from this site?

+++++++++++++++++++++++++++++++++ enter image description here

enter image description here

score 1 · Accepted Answer · answered Jul 10 '15 at 00:52

1

Yep, the problem is in the selector you're using. You see, the markup in that website is, well, pretty bad. The tables are not properly coded and they actually lack the tbody tags, but Google Chrome adds them nevertheless, that's why you're seeing them in the Web Developer Tools.

However, as I said, they're not in the actual HTML code, so there's no way BeautifulSoup will be able to match the rows if you use tbody in your selector. Looks like that table has the class chart-wide, so you could target the rows using:

rows = soup.select('.chart-wide tr')

After that, you can iterate over those rows, skipping the first one (because that'd be the header) and parsing the other ones and their individual cells.

Something like this:

def getRank(item_url):
    href = item_url[:37]+"page=weekend&" + item_url[37:]
    response = requests.get(href)
    print response.status_code, "for", href
    soup = BeautifulSoup(response.content)  # or BeautifulSoup(response.content, "html5lib")

    rows = soup.select('.chart-wide tr')

    header_skipped = False
    for row in rows:
        if not header_skipped:
            header_skipped = True
            continue

        headers = "Date Rank WeekendGross Change Theaters Change/Avg GrossToDate Week".split()

        for header, child in zip(headers, row.children):
            print header, ":", child.text

answered Jul 10 '15 at 00:52

José Tomás Tocino

9,873
5
44
78

For some reason the "child.text" line isn't working. I've tried child.string, and child.getText() too. The specific error is UnicodeEncodeError: 'charmap' codec can't encode character u'\x96' in position 6: character maps to . If I just print the header part, it works – alphamonkey Jul 10 '15 at 01:08
Are you sure? This is the entire script I'm using and it works in my machine: https://ideone.com/Jt3OCh – José Tomás Tocino Jul 10 '15 at 01:09
I think there could be something wrong with the encoding file, based on the error File "C:/Users/younjin/PycharmProjects/untitled/movies.py", line 96, in getRank print header, ":", child.text File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\x96' in position 6: character maps to – alphamonkey Jul 10 '15 at 01:19
if I copy and paste your code exactly, I'm still getting this error. I've posted a picture of the "line 12 in encode return" above – alphamonkey Jul 10 '15 at 01:20
1

Looks like your local encoding is different from the website's. It has nothing to do with the actual scraping, but anyhow, try using the solution provided here to handle the encoding: http://stackoverflow.com/a/16120218/276451 – José Tomás Tocino Jul 10 '15 at 01:26
Thanks, all I had to do was add a ".encode('utf-8')" at the end of the print statement. – alphamonkey Jul 10 '15 at 01:41

score 0 · Answer 2 · answered Jul 10 '15 at 01:03

Seemingly, this chart table is dynamic generate, using Phantomjs, everything is ok

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.get('http://www.boxofficemojo.com/movies/?page=weekend&id=jurassicpark4.htm')
soup = BeautifulSoup(driver.page_source)
soup.select('table.chart-wide tbody tr td font')

out[1]

[<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=date&amp;order=DESC&amp;p=.htm"><b>Date<br>(click to view chart)</br></b></a></font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=rank&amp;order=ASC&amp;p=.htm">Rank</a></font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=wkndgross&amp;order=DESC&amp;p=.htm">Weekend<br>Gross</br></a></font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=perchange&amp;order=DESC&amp;p=.htm">%<br>Change</br></a></font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=theaters&amp;order=DESC&amp;p=.htm">Theaters</a></font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=theaterchange&amp;order=ASC&amp;p=.htm">Change</a> / </font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=avg&amp;order=DESC&amp;p=.htm">Avg.</a></font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=todategross&amp;order=DESC&amp;p=.htm">Gross-to-Date</a></font>,
<font size="2"><a href="/movies/?page=weekend&amp;id=jurassicpark4.htm&amp;sort=weeknum&amp;order=ASC&amp;p=.htm">Week<br>#</br></a></font>,
<font size="2"><a href="/weekend/chart/?yr=2015&amp;wknd=24&amp;p=.htm"><b>Jun 12–14</b></a></font>,
<font size="2">1</font>,
<font size="2">$208,806,270</font>,
<font size="2">-</font>,
<font size="2">4,274</font>,
.
.
.
<font size="2">$500,373,420</font>,
<font size="2">3</font>,
<font size="2"><a href="/weekend/chart/?yr=2015&amp;wknd=27&amp;p=.htm"><b>Jul 3–5</b></a></font>,
<font size="2">2</font>,
<font size="2">$29,242,025</font>,
<font size="2"><font color="#ff0000">-46.4%</font></font>,
<font color="#ff0000">-46.4%</font>,
<font size="2">3,737</font>,
<font size="2"><font color="#ff0000">-461</font></font>,
<font color="#ff0000">-461</font>,
<font size="2">$7,825</font>,
<font size="2">$556,542,980</font>,
<font size="2">4</font>]

Python BeautifulSoup webcrawling: Getting text that doesn't have links or class tags

2 Answers2