0

I'm looking to create a Flask backend that scrapes a website using Selenium and BS4. The API will be called using an arbitrary front-end that can give an input for <link>. I currently have it working using the following code:

app = Flask(__name__)

from selenium import webdriver
from bs4 import BeautifulSoup

@app.route('/scrape/<url>', methods=['GET'])
def scrape_site(url):
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')

    return soup

However, for the kind of pages that I want to scrape, content is rapidly added, but the content resets if you open the page in a new browser. Thus, the page must be opened, waiting must occur, and then the page's content can be scraped. A simple time.sleep() won't work because the page content needs to be scraped multiple times as it's updated, on-demand.

Here's a mock up of how the app would work: enter image description here

The user enters a link in the input box, then clicks the light blue submit button and an API call is made to open the site in a Selenium browser. From that point on, the user can click the dark blue scrape site contents button (as many times as they want) and the site's current content should be displayed in the gray box at the bottom. For simplicity the content is represented as the BS4 soup object, but in reality more specific info would be parsed.

TLDR: I want to be able to open the website in Selenium using one API call, and then, with another API call, scrape the site's content using BS4. I'm not sure how to transfer the browser object.

Here's what I have so far:

app = Flask(__name__)

from selenium import webdriver
from bs4 import BeautifulSoup

@app.route('/open_site/<url>', methods=['GET'])
def open_site(url):
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    return 'site opened!'   # just a confirmation message, nothing need be returned to front-end


@app.route('/scrape_site', methods=['GET'])
def scrape_site():
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')

    return soup

This doesn't work because the driver object is undefined in the second call. Is there a way to pass it from the open_site call to the scrape_site call?

rbb
  • 89
  • 1
  • 5

1 Answers1

0

I'm not sure if is it the best way to do it,
but you can make driver global by defining t outside of function

import uuid
from flask import Flask, session

app = Flask(__name__)
app.secret_key = 'any random string'
user_drivers = {} 

@app.route('/open_site/<url>', methods=['GET'])
def open_site(url):
    user_id = session.get("session-id")
    if user_id is None:
        user_id = uuid.uuid4()
    
    driver = user_drivers.get(user_id)
    if driver is None:
        driver = webdriver.Chrome(options=options)
        user_drivers[user_id] = driver

    driver.get(url)
    return 'site opened!'   # just a confirmation message, nothing need be returned to front-end

  • This works in a test environment, but I'm not sure if it would work well when deployed, since (I think) it implies that from the moment I begin hosting my backend, there will be a single Selenium browser always open. I think I'm going to look into using sessions, as described in lhk's response [here](https://stackoverflow.com/questions/32815451/are-global-variables-thread-safe-in-flask-how-do-i-share-data-between-requests). – rbb Jan 04 '22 at 06:18
  • You are right I've thought about that but wouldn't it make too many browser open and close at same time? I think you also can run a driver manager code on another thread to manage your driver, open new tab instead of a new driver and close them if not used after a time period. – Erfan Ghofrani Jan 04 '22 at 16:24
  • All the code I've seen on stackoverflow for opening multiple tabs on Selenium can only open 2 tabs, and generally multiple tabs isn't supported since Selenium can only focus on one tab at a time. The goal is to have one browser per user; I tried using flask-session but it can't serialize a Webdriver instance so that didn't work. Conceptually flask-session seems like it has what I want, though; going to ask a more specific question to see if there's a way to get around this limitation. – rbb Jan 04 '22 at 17:17