How to correctly scrape a JavaScript-based site?

Question

I'm testing the code below.

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp_intranet"
wd.get(url)

# set username
time.sleep(2)
username = wd.find_element_by_id("id_email")
username.send_keys("my_email@corp.com")

# set password
password = wd.find_element_by_id("id_password")
password.send_keys("my_password")


url=("https://corp_intranet")
r = requests.get(url)
content = r.content.decode('utf-8')
print(BeautifulSoup(content, 'html.parser'))

This logs into my corporate intranet fine, but it just prints very, very basic information. Hitting F12 shows me that a lot of the data on the page is rendered using JavaScript. I did a little research on this, and tried to find a way to actually grab what I see on the screen, rather than a very, very diluted version of what I can see. Is there some way to do a big data dump of all the data that is displayed on the page? Thanks.

This sounds like an [X-Y problem](http://xyproblem.info/). Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do? — undetected Selenium, Nov 21 '18 at 06:10

score 1 · Accepted Answer · answered Nov 20 '18 at 23:22

you open 2 browser delete this line

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

the problem is in selenium you're logged in but not in requests because it use different session

.....
.....
# missing click button? add "\n" to submit or click the button
password.send_keys("my_password\n")

# wait max 10 seconds until "theID" visible in Logged In page
WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))

content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))

score 0 · Answer 2 · answered Nov 20 '18 at 22:05

0

You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.

An Implicit Wait lets you choose a specific amount of time to wait before scraping.

An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.

This answer goes into detail on this concept.

answered Nov 20 '18 at 22:05

Ryan

387
3
13

Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird! – ASH Nov 20 '18 at 22:31
I suggested this because the code you posted is just grabbing a url with `requests.get`, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's `webdriver.page_source` to get the HTML from the web browser that Selenium opened. – Ryan Nov 20 '18 at 22:52

How to correctly scrape a JavaScript-based site?

2 Answers2