How can I parse a website using Selenium and Beautifulsoup in python?

Question

New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?

Any help appreciated -

This isn't a question unfortunately, you should ask something more specific. — jdotjdot, Dec 19 '12 at 20:14
Twitch, if you're really new to Python and programming in general, I'd try working you way through http://learnpythonthehardway.org/ -- based on some of your questions below I think it would help a lot. From there, you'll be able to post more specific (and answerable) questions here. — Amanda, Dec 19 '12 at 21:28

score 159 · Accepted Answer · edited May 31 '22 at 08:22

159

Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:

from bs4 import BeautifulSoup

from selenium import webdriver

driver = webdriver.Firefox()

driver.get('http://news.ycombinator.com')

html = driver.page_source

soup = BeautifulSoup(html)

for tag in soup.find_all('title'):
    print(tag.text)
    
Hacker News

edited May 31 '22 at 08:22

Paul

1,874
1
19
26

answered Dec 19 '12 at 20:19

RocketDonkey

36,383
7
80
84

@root Haha, a nice holiday exchange. – RocketDonkey Dec 19 '12 at 20:23
@RocketDonkey - soup = BeautifulSoup(html) NameError: name 'html' is not defined This is the error I get, any suggestions – twitch after coffee Dec 19 '12 at 21:05
1

@twitchaftercoffee So in the code above, `html` refers to the source of the page. Whenever you reach your page, your `driver` object will have an attribute called `page_source`, and the code above assigns that value to `html`. Note that this step isn't really necessary as you could just pass `driver.page_source` directly to BeautifulSoup (as root did above). – RocketDonkey Dec 19 '12 at 21:07
@RocketDonkey - Worked, doesn't toss up errors, but doesn't actually print anything – twitch after coffee Dec 19 '12 at 21:15
@twitchaftercoffee So the example up there looks for a `title` tag, so in the odd case the page doesn't have one then nothing will show. Try running `print soup.prettyify()` - do you see anything? – RocketDonkey Dec 19 '12 at 21:19
Make that `soup.prettify()`... – RocketDonkey Dec 19 '12 at 22:00
@RocketDonkey I want to do the opposite thing. I want to select a element using beautifulsoup and then perform action using chrome driver.How can I do this – Rahul Satal Jan 24 '17 at 11:03
`selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH`. – 3kstc Apr 12 '20 at 01:46

root · Answer 2 · 2012-12-19T20:36:16.753

23

As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Firefox()
browser.get('http://webpage.com')

soup=BeautifulSoup(browser.page_source)

#do something useful
#prints all the links with corresponding text

for link in soup.find_all('a'):
    print link.get('href',None),link.get_text()

edited Dec 19 '12 at 20:36

answered Dec 19 '12 at 20:18

root

76,608
25
108
120

+1, didn't see this come up as I was typing :) – RocketDonkey Dec 19 '12 at 20:20
For this, I got soup=BeautifulSoup(browser.page_source) NameError: name 'browser' is not defined – twitch after coffee Dec 19 '12 at 20:51
the code is ok. `browser=webdriver.Firefox()` defines `browser`. just copy the code directly...you must have made a mistake. – root Dec 19 '12 at 21:08
@root - got it, but did not print anything. Running it outside of python by python xx.py – twitch after coffee Dec 19 '12 at 21:12
`soup=BeautifulSoup(browser.page_source)` it's the same with chrome – root Dec 19 '12 at 21:16
@root I want to do the opposite thing. I want to select a element using beautifulsoup and then perform action using chrome driver.How can I do this – Rahul Satal Jan 24 '17 at 11:04

score 2 · Answer 3 · answered Dec 19 '12 at 20:14

2

Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.

I can give you a sample code, that I just wrote, just change url and you good to go:

#! /usr/bin/env python2.7

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
import sys, signal

class Browser(QWebView):
    def __init__(self):
        QWebView.__init__(self)
        self.loadProgress.connect(self._progress)
        self.loadFinished.connect(self._loadFinished)
        self.frame = self.page().currentFrame()

    def _progress(self, progress):
        print str(progress) + "%"

    def _loadFinished(self):
        print "Load Finished"
        html = unicode(self.frame.toHtml()).encode('utf-8')
        soup = BeautifulSoup(html)
        print soup.prettify()
        self.close()

if __name__ == "__main__":
    app = QApplication(sys.argv)
    br = Browser()
    url = QUrl('http://web site that can contain javascript.com')
    br.load(url)
    br.show()
    if signal.signal(signal.SIGINT, signal.SIG_DFL):
        sys.exit(app.exec_())
    app.exec_()

answered Dec 19 '12 at 20:14

Vor

33,215
43
135
193

I have found PyQt4 a humongous pain to use. Depending on OP's requirements, just using BeautifulSoup is probably a lot easier. – jdotjdot Dec 19 '12 at 20:14
what you mean, " just using BeautifulSoup is probably a lot easier." – Vor Dec 19 '12 at 20:17
OP here, Beautiful soup allowed me to nav to the section I want to parse very easy. I'd prefer to stick with it if possible. – twitch after coffee Dec 19 '12 at 20:48
I'd love to use pyqt4 instead of selenium - it's so much faster. but when I install it via windows binary - and try and import it and run that code, it can't find the library. Please help – yoshiserry May 19 '14 at 04:52
@Vor I am looking solution to port my CLI Selenium tool to GUI based, Will an embed browser control in PyQT can be accessed by Selenium? – Volatil3 Jun 16 '16 at 19:19

How can I parse a website using Selenium and Beautifulsoup in python?

3 Answers3

Linked

Related