2

I'm using the following code to obtain all <script>...</script> content from a webpage (see url in code):

import urllib2
from bs4 import BeautifulSoup
import re
import imp

url = "http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

script = soup.find_all("script")
print script #just to check the output of script

However, BeautifulSoup searches within the source code (Ctrl+U in chrome) of the webpage. However, I want to make BeautifulSoup search within the element code (Ctrl+Shift+I in chrome) of the webpage.

I want it to do this because the piece of code I'm really interested in is in the Element code and not in the Source code.

  • Do you mean you want to query within the code that is in the script tag? – ben_aaron Mar 21 '16 at 11:42
  • Yes, that's an option. I also figured out that I'm going to need to use `json` instead of `BS`, because I want to query in the code that is generated after javascript does its thing. But I still don't know how to do it. However, I could also just obtain the content of ` –  Mar 21 '16 at 11:44
  • OK. Say you want the first `script` tag, you can use `contents` to get the code and then query it with regex, etc. See what happens if you change your last line to `script[0].contents` – ben_aaron Mar 21 '16 at 11:46
  • This still searches within the Source code of the page though, and I want to search in the code that is generated after javascript is done generating page elements. –  Mar 21 '16 at 11:48
  • OK. One approach helping here might be something like [this](http://stackoverflow.com/questions/11047348/is-this-possible-to-load-the-page-after-the-javascript-execute-using-python) and [this tutorial](http://www.kochi-coders.com/2014/05/06/scraping-a-javascript-enabled-web-page-using-beautiful-soup-and-phantomjs/). Hope this helps a bit. – ben_aaron Mar 21 '16 at 11:55
  • Already tried those, I can't get them to work. –  Mar 21 '16 at 12:13

1 Answers1

5

First thing to understand is that neither BeautifulSoup, nor urllib2 is a browser. urllib2 would only get/download you the initial "static" page - it cannot execute JavaScript as a real browser would do. Hence, you will always get the "View Page Source" content.

To solve your problem - fire up a real browser via selenium, wait for the page to load, get the .page_source and pass it to BeautifulSoup to parse:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")

# wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))

# get the page source
page_source = driver.page_source

driver.close()

# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)

This is the general approach, but your case is a little bit different - there is an iframe element which contains the video player. If you want to access the script elements inside the iframe, you would need to switch to it and then get the .page_source:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")

# wait for the page to load, switch to iframe
wait = WebDriverWait(driver, 10)
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
driver.switch_to.frame(frame)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))

# get the page source
page_source = driver.page_source

driver.close()

# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195