Why are requests and urllib2 missing some text from webpages?

Question

The following code extracts webpage info

from BeautifulSoup import BeautifulSoup
import requests
import urllib2

url = 'http://www.surfline.com/surf-report/rincon-southern-california_4197/'

source_code = requests.get(url)
plain_text = source_code.text
print plain_text

site = urllib2.urlopen(url).read()
print site

Both libraries results include:

<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>

Unfortunately this is different from the actual webpage:

<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;">4-5ft</div>

4-5ft is not present and therefore cannot be extracted by BeautifulSoup.

Probably the data is loaded asynchronously after the `HTTP/1.1 200` response is sent back. PS. crawling data from websites is not always legal, check the licenses for the published data or look for a REST service providing similar data. — tuned, Jan 19 '16 at 19:19
`requests` and `urllib2` are never going to execute the JavaScript. But I can show you solution in `selenium`. — George Petrov, Jan 19 '16 at 19:56

score 1 · Answer 1 · answered Jan 19 '16 at 20:47

1

Install the selenium, full instruction in docs.

pip3 install selenium

Download drivers. I prefer to use chrome driver, but if you have firefox installed, code below should work fine.

from selenium import webdriver
url = 'http://www.surfline.com/surf-report/rincon-southern-california_4197/'
web = webdriver.Firefox()
# web = webdriver.Remote('http://localhost:9515', desired_capabilities=DesiredCapabilities.CHROME)

source_code = web.get(url)
# Sometimes it take time to load the page that's why: from time import sleep; sleep(2)
plain_text = source_code.page_source

answered Jan 19 '16 at 20:47

George Petrov

2,729
1
13
20

I used `web = webdriver.Chrome()` instead.. Unfortunately, I get the error: _AttributeError: 'NoneType' object has no attribute 'page_source'_ with multiple sleep lengths. Also, it seems unreasonable to open a browser page and wait for it to load when scraping multiple pages. Similar issue [here](http://stackoverflow.com/questions/7861775/python-selenium-accessing-html-source) – boogie_bullfrog Jan 23 '16 at 16:59

Why are requests and urllib2 missing some text from webpages?

1 Answers1