0

I am trying to parallelize the execution of a loop which retrieves data from a website using selenium. In my loop I loop over a list of URLs URLlist I created before.

Firstly I log in to the page and thus create an instance of the webdriver.

browser = webdriver.Chrome(executable_path='chromedriver.exe')
browser.get('https://somepage.com')
username = browser.find_element_by_id("email")
password = browser.find_element_by_id("password")
username.send_keys("foo@bar.com")
password.send_keys("pwd123")
browser.find_element_by_id("login-button").click()

Then my loop starts and calls some functions which operate on the page.

for url in URLlist:
   browser.get(url)
   data1 = do_stuff()
   data2 = do_other_stuff()

I don't quite know where to start because I can imagine that I need an instance of the webdriver for each thread.

What is the right (and maybe easiest) way to do this?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ramona
  • 335
  • 1
  • 5
  • 18
  • Possible duplicate of [Python parallel execution with selenium](https://stackoverflow.com/questions/42732958/python-parallel-execution-with-selenium) – Mate Mrše Apr 25 '19 at 10:23

2 Answers2

0

You'll need to create your test methods in a separate .py file, install pytest library package and invoke your .py file using pytest. Launch python from cmd and try something on these lines:

-m pytest -n 3 C:\test_file.py --html=C:\Report.html

In this case, 3 test methods will run in parallel

anish
  • 88
  • 6
-1

To ease parallizing webscraping you need to install numpy.

python -m pip install numpy

With that done you can easliy achieve what you want. Here is a simple example:

import threading
import numpy as np

#tupel to save the Threads
threads = []

threadCount = 5 #Number of Threads you want

#Custom Thread class 
class doStuffThread(threading.Thread):
    def __init__(self, partLinks):
        threading.Thread.__init__(self)
        self.partLinks = partLinks
    def run(self):
        #New browser instance for each Thread
        browser = webdriver.Chrome(executable_path='chromedriver.exe')
        for link in self.partLinks:
            browser.get(link)
            doStuff(link)
            doOtherStuff(link)

#Split the links to give each thread a part of them
for  partLinks in np.array_split(links,threadCount):
     t = CommentCrawlerThread(partlinks)
     threads.append(t)
     t.start()
#wait till all Threads are finished
for x in threads:
    x.join()
Abrogans
  • 179
  • 1
  • 13