Selenium current URL for BS processing

Question

I'm trying to scrape data from a message board and have hit a wall I can't seem to get around. I've managed to use Selenium to click through to a page that I want to pull data from but I need to pass it into Beautiful Soup first (or, at least, I think I do). What I can't figure out is how to tell BS that the page I've landed on is the one to transform without explicitly calling a get on the url.

I tried to get around this by defining the url but it still comes back as a None type.

Currently using Python 2.7 on Mac

Here's my current code/:

subj = driver.find_element_by_class_name('subject-link').click()
cu = driver.current_url
sub2 = driver.get(cu)
print(sub2)

My expectation would be that the url would print but instead it prints "None" and my assumption is that if I can get sub2 to be the url I'll then be able to get lists of strings for each of these categories

dads_starts = []
dads_participating = []
dads_messages = []
soupm = BeautifulSoup(subj.content, "lxml")

###Appends dads_starts
for d in soupm.find(class_='disabled-link'):
    dads_starts.append(d.text)

###Appends dads_participating
for d in soupm.findAll(class_='disabled-link'):
    dads_participating.append(d.text)

###Appends dads_messages
for d in soupm.findAll(class_='message-text'):
    dads_messages.append(d.text)

Try passing the html of the page selenium landed on to beautiful soup. See below for an example. https://stackoverflow.com/questions/13960326/how-can-i-parse-a-website-using-selenium-and-beautifulsoup-in-python — reticentroot, Aug 01 '17 at 22:26
get the page source using selenium and pass the content to Beautifulsoup. — Barney, Aug 01 '17 at 22:37

score 0 · Answer 1 · answered Aug 01 '17 at 22:45

Works fine with lxml parser also. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

def parse_content(html_doc):
    soup = BeautifulSoup(html_doc, 'html.parser')
    print (soup.title.text)


driver = webdriver.Chrome()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
html_doc = driver.page_source
driver.close()
parse_content(html_doc)

Selenium current URL for BS processing

1 Answers1