I'm looking to create a Flask backend that scrapes a website using Selenium and BS4. The API will be called using an arbitrary front-end that can give an input for <link>
. I currently have it working using the following code:
app = Flask(__name__)
from selenium import webdriver
from bs4 import BeautifulSoup
@app.route('/scrape/<url>', methods=['GET'])
def scrape_site(url):
driver = webdriver.Chrome(options=options)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
return soup
However, for the kind of pages that I want to scrape, content is rapidly added, but the content resets if you open the page in a new browser. Thus, the page must be opened, waiting must occur, and then the page's content can be scraped. A simple time.sleep()
won't work because the page content needs to be scraped multiple times as it's updated, on-demand.
Here's a mock up of how the app would work:
The user enters a link in the input box, then clicks the light blue submit button and an API call is made to open the site in a Selenium browser. From that point on, the user can click the dark blue scrape site contents button (as many times as they want) and the site's current content should be displayed in the gray box at the bottom. For simplicity the content is represented as the BS4 soup
object, but in reality more specific info would be parsed.
TLDR: I want to be able to open the website in Selenium using one API call, and then, with another API call, scrape the site's content using BS4. I'm not sure how to transfer the browser object.
Here's what I have so far:
app = Flask(__name__)
from selenium import webdriver
from bs4 import BeautifulSoup
@app.route('/open_site/<url>', methods=['GET'])
def open_site(url):
driver = webdriver.Chrome(options=options)
driver.get(url)
return 'site opened!' # just a confirmation message, nothing need be returned to front-end
@app.route('/scrape_site', methods=['GET'])
def scrape_site():
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
return soup
This doesn't work because the driver
object is undefined in the second call. Is there a way to pass it from the open_site
call to the scrape_site
call?