4

I have a web page open and logged in using webdriver code. Using webdriver for this because the page requires login and various other actions before I am set to scrape.

The aim is to scrape data from this open page. Need to find links and open them, so there will be a lot of combination between selenium webdriver and BeautifulSoup.

I looked at the documentation for bs4 and the BeautifulSoup(open("ccc.html")) throws an error

soup = bs4.BeautifulSoup(open("https://m/search.mp?ss=Pr+Dn+Ts"))

OSError: [Errno 22] Invalid argument: 'https://m/search.mp?ss=Pr+Dn+Ts'

I assume this is because its not a .html?

Sid
  • 3,749
  • 7
  • 29
  • 62
  • see [how to get innerHTML of whole page in selenium driver](https://stackoverflow.com/questions/35905517/how-to-get-innerhtml-of-whole-page-in-selenium-driver) – robyschek Jan 23 '17 at 17:26

1 Answers1

8

You are trying to open a page by a web address. open() would not do that, use urlopen():

from urllib.request import urlopen  # Python 3
# from urllib2 import urlopen  # Python 2

url = "your target url here"
soup = bs4.BeautifulSoup(urlopen(url), "html.parser")

Or, use an HTTP for humans - requests library:

import requests

response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, "html.parser")

Also note that it is strongly advisable to specify a parser explicitly - I've used html.parser in this case, there are other parsers available.


I want to use the exact same page(same instance)

A common way to do it is to get the driver.page_source and pass it to BeautifulSoup for further parsing:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)

# wait for page to load..

source = driver.page_source
driver.quit()  # remove this line to leave the browser open

soup = BeautifulSoup(source, "html.parser")
Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 2
    I think I didn't explain properly, the page is already open. :( I want to use the exact same page(same instance) opened by selenium. In both the examples I assume a new url based request is being made to the open/get the data. – Sid Jan 23 '17 at 17:22
  • 1
    @Sid alright, I've updated the answer - please see if this is what you've meant. Thanks. – alecxe Jan 23 '17 at 17:25
  • The third one was exactly what I was looking for. :) Thanks – Sid Jan 23 '17 at 17:31