Making Webscraping with selenium faster (or faster alternatives)

Question

Guys i currently have a working script that scrapes ajax content from an certain page...the thing is that it take +- 12 seconds to run and for my purposes i would need it to be faster.

Any tips?

from urllib.parse import urlencode
import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

def search_char():
    char_name_input = str(input('Search Character: ')) # User input / Character Name

    start_time = time.time()
    browser = webdriver.PhantomJS()


    search_url = 'https://www.tibia.com/community/?subtopic=characters' # URL 

    r = browser.get(search_url) # Searched Character Page

    element = browser.find_element_by_name("name")
    element.send_keys(char_name_input)
    element2 = browser.find_element_by_name("Submit").click()


    page = browser.find_element_by_id('Content')
    rendered_page = page.get_attribute('innerHTML')

    soup = BeautifulSoup(rendered_page, 'html.parser')



    selection = soup.find_all('td')

    lista = []
    for item in selection:
        lista.append(item.get_text())

    browser.close()

    print("--- %s seconds ---" % (time.time() - start_time))

    for i in lista:
        print(i,'\n')

search_char()

Make webscraping faster: use an API. 12 seconds is a fantastic amount of time for UI automation to execute. I frequently run scripts that take anywhere from 1 minute (minimum) to 15 minutes max. Rendering a browser and HTML content on a page requires response times from the website you are automating -- Selenium / Python is actually the fastest way to accomplish UI auto. If you want more speed, use `requests`. — CEH, Jan 20 '20 at 16:45
HTMLUnit is probably a little faster... cURL would be good for direct requests. Or even Postman? — pcalkins, Jan 20 '20 at 20:30
@Christine - Selenium / Python is fastest a bold statement. I wonder if you've tried Puppeteer. — pguardiario, Jan 21 '20 at 02:44
I’d love to give it a try! I work with C# often so the added speed is novel for me. — CEH, Jan 21 '20 at 05:18

score 6 · Answer 1 · answered Jan 21 '20 at 03:04

I have a few tips:

switch to headless chrome, it will be faster
set capabilities.pageLoadStrategy to "none" and use WebDriverWait / EC to wait on elements. This way it can continue before everything loads
Always use css selectors instead of name / id / xpath
send_keys is slow, set those values with javascript
You don't need beautiful soup, here's an example how to get those:

lista = browser.execute_script(" return [...document.querySelectorAll('#Content td')].map(s => s.innerText) ")

I expect you to cut the time in half if you do all these, and even less if you switch to Puppeteer

score 0 · Answer 2 · answered Jan 22 '20 at 11:18

To start with, if you are dealing with a webpage where elements are JavaScript enabled elements or contains AJAX elements, there is no readymade solution available to scrape the contents faster. However with respect to your code snippets here are a couple of suggestions:

If your usecase involves invoking click() or send_keys() always induce WebDriverWait for the element_to_be_clickable() as follows:

You can find a detailed discussion in How to click on a element through Selenium Python

If your usecase involves invoking get_attribute('innerHTML') always induce WebDriverWait for the visibility_of_element_located() as follows:

You can find a detailed discussion in Python + Selenium: Wait until element is fully loaded

As per the current WebDriver-W3C Recommendation the following is the list of preferred Locator Strategies:

Locator Strategies_W3C

There is some difference in the performance using CssSelector and XPath. A few take aways:
- For starters there is no dramatic difference in performance between XPath and CSS.
- Traversing the DOM in older browsers like IE8 does not work with CSS but is fine with XPath. And XPath can walk up the DOM (e.g. from child to parent), whereas CSS can only traverse down the DOM (e.g. from parent to child). However not being able to traverse the DOM with CSS in older browsers isn't necessarily a bad thing as it is more of an indicator that your page has poor design and could benefit from some helpful markup.
- An argument in favor of CSS is that they are more readable, brief, and concise while it is a subjective call.
- Ben Burton mentions you should use CSS because that's how applications are built. This makes the tests easier to write, talk about, and have others help maintain.
- Adam Goucher says to adopt a more hybrid approach -- focusing first on IDs, then CSS, and leveraging XPath only when you need it (e.g. walking up the DOM) and that XPath will always be more powerful for advanced locators.
- You can find a detailed discussion in Why should I ever use CSS selectors as opposed to XPath for automated testing?

Reference

You can find a relevant detailed discussion in How to speed up Java Selenium Script,with minimum wait time

Making Webscraping with selenium faster (or faster alternatives)

2 Answers2

Reference