2

I'm trying to scrape a javascript website using scrapy and selenium. I open the javascript website using selenium and a chrome driver and I scrape all the links to different listings from the current page using scrapy and store them in a list (this has been the best way to do it so far as trying to follow links using seleniumRequest and callingback to a parse new page function has caused a lot errors). Then, I loop through the list of URLs, open them in the selenium driver and scrape the info from the pages. So far this scrapes 16 pages/ minute which is not ideal given the amount of listings on this site. I would ideally have the selenium drivers opening links in parallel like the following implementations:

How can I make Selenium run in parallel with Scrapy?

https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b

However, I can't figure out how to implement parallel processing in my selenium-scrapy code. `

    import scrapy
    import time
    from scrapy.selector import Selector
    from scrapy_selenium import SeleniumRequest
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import Select
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC

class MarketPagSpider(scrapy.Spider):
    name = 'marketPagination'
def start_requests(self):
    yield SeleniumRequest(
        url="https://www.cryptoslam.io/nba-top-shot/marketplace",
        wait_time=5,
        wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')),
        callback=self.parse
    )

responses = []

def parse(self, response):
    # initialize driver
    driver = response.meta['driver']
    driver.set_window_size(1920,1080)

    time.sleep(1)
    WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]"))
    )

    rows = response_obj.xpath("//tbody/tr[@role='row']")
    for row in rows:
        link = row.xpath(".//td[4]/a/@href").get()
        absolute_url = response.urljoin(link)

        self.responses.append(absolute_url)

    for resp in self.responses:
        driver.get(resp)
        html = driver.page_source 
        response_obj = Selector(text=html)

        yield {
        'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(),
        'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get()
        
        }

I know that scrapy-splash can handle multiprocessing but the website I'm trying to scrape doesn't open in splash (at least I don't think)

As well, I've deleted the lines of code for pagination to keep the code concise.

I'm very new to this and open to any suggestions and solutions to multiprocessing with selenium.

Enderh3art
  • 23
  • 3
  • Post you multiprocessing code, it works as usual, but each "thread / process" should use his own driver – Wonka Feb 05 '21 at 09:21
  • @Wonka I'm not really sure how to implement that. I'm very unfamiliar with the multiprocessing library in general, I apologize – Enderh3art Feb 05 '21 at 18:44
  • See [this question}(https://stackoverflow.com/questions/53475578/python-selenium-multiprocessing) for the basic technique and the accepted answer and my (Booboo) answer, which ensures that drivers terminate when you are done. The accepted answer is a technique that uses one driver per thread instead of one driver per URL. In other words, it reuses the drivers just as you reuse your driver for all the URLs in your non-threading code.. – Booboo Feb 06 '21 at 12:25
  • @Booboo Hey thanks for your answer! I managed to get selenium to multiprocess like your solution. However, I can't seem to delete the drivers after the script is done even though I put del threadlocal at the end. I actually end up getting this error: NameError: name 'threadLocal' is not defined – Enderh3art Feb 07 '21 at 03:54
  • In the accepted answer is the declaration `threadLocal = threading.local()`. I didn't copy to my answer that required line on the assumption that it was understood. I have now updated the answer to make that declaration explicit. – Booboo Feb 07 '21 at 12:50
  • @Booboo Yeah I had that in my code but for some reason i still get the error. I think I might just manually call the destructor method in the driver class – Enderh3art Feb 08 '21 at 20:51
  • Did you include `del threadLocal; import gc; gc.collect()` when you are all done (as in my answer)? That should result in calling the destructors (it did for me, anyway). – Booboo Feb 08 '21 at 21:07
  • @Booboo Yea, I put that at the end of my code, is there any way you can show me an example of your code? the multiprocessing works just fine, it's just closing the drivers that causes me problems – Enderh3art Feb 08 '21 at 23:51
  • I posted an answer. – Booboo Feb 09 '21 at 01:13
  • Did the answer I posted help? – Booboo Feb 15 '21 at 12:04
  • @Booboo it actually worked perfectly! I would upvote it but it's not letting me since I'm a newbie – Enderh3art Feb 16 '21 at 01:38

1 Answers1

1

The following sample program creates a thread pool with only 2 threads for demo purposes and then scrapes 4 URLs to get their titles:

from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
import gc

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # suppress logging:
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)
        print('The driver was just created.')

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        print('The driver has terminated.')


threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


def get_title(url):
    driver = create_driver()
    driver.get(url)
    source = BeautifulSoup(driver.page_source, "lxml")
    title = source.select_one("title").text
    print(f"{url}: '{title}'")

# just 2 threads in our pool for demo purposes:
with ThreadPool(2) as pool:
    urls = [
        'https://www.google.com',
        'https://www.microsoft.com',
        'https://www.ibm.com',
        'https://www.yahoo.com'
    ]
    pool.map(get_title, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()
# pool.terminate() is called at exit of with block

Prints:

The driver was just created.
The driver was just created.
https://www.google.com: 'Google'
https://www.microsoft.com: 'Microsoft - Official Home Page'
https://www.ibm.com: 'IBM - United States'
https://www.yahoo.com: 'Yahoo'
The driver has terminated.
The driver has terminated.
Booboo
  • 38,656
  • 3
  • 37
  • 60