2

The following code extracts webpage info

from BeautifulSoup import BeautifulSoup
import requests
import urllib2

url = 'http://www.surfline.com/surf-report/rincon-southern-california_4197/'

source_code = requests.get(url)
plain_text = source_code.text
print plain_text

site = urllib2.urlopen(url).read()
print site

Both libraries results include:

<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>

Unfortunately this is different from the actual webpage:

<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;">4-5ft</div>

4-5ft is not present and therefore cannot be extracted by BeautifulSoup.

CoderPi
  • 12,985
  • 4
  • 34
  • 62

1 Answers1

1
  1. Install the selenium, full instruction in docs.

pip3 install selenium

  1. Download drivers. I prefer to use chrome driver, but if you have firefox installed, code below should work fine.
from selenium import webdriver
url = 'http://www.surfline.com/surf-report/rincon-southern-california_4197/'
web = webdriver.Firefox()
# web = webdriver.Remote('http://localhost:9515', desired_capabilities=DesiredCapabilities.CHROME)

source_code = web.get(url)
# Sometimes it take time to load the page that's why: from time import sleep; sleep(2)
plain_text = source_code.page_source
George Petrov
  • 2,729
  • 1
  • 13
  • 20
  • I used `web = webdriver.Chrome()` instead.. Unfortunately, I get the error: _AttributeError: 'NoneType' object has no attribute 'page_source'_ with multiple sleep lengths. Also, it seems unreasonable to open a browser page and wait for it to load when scraping multiple pages. Similar issue [here](http://stackoverflow.com/questions/7861775/python-selenium-accessing-html-source) – boogie_bullfrog Jan 23 '16 at 16:59