1

I am using selenium and chrome-driver to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)

My program has a button. Every time it's pressed it calls the thread_(self) (bellow), starting a new thread. The target function self.main has the code to run all the selenium work on a chrome-driver.

def thread_(self):
    th = threading.Thread(target=self.main)
    th.start()

My problem is that after the user press the first time. This th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.

I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main!

Flu_Py Developer
  • 191
  • 2
  • 13
  • Does this answer your question? [How can I make a function run in the background without blocking my entire program?](https://stackoverflow.com/questions/60778753/how-can-i-make-a-function-run-in-the-background-without-blocking-my-entire-progr) – imbr Aug 30 '21 at 18:20
  • no it doesn't,. – Flu_Py Developer Aug 30 '21 at 18:24
  • but it is close – Flu_Py Developer Aug 30 '21 at 18:25
  • do you want to create Synchronizing Threads with the same thread function with some button ideas so that it can go like multithreading? – Fatin Ishrak Rafi Aug 30 '21 at 18:27
  • yes, sth like this – Flu_Py Developer Aug 30 '21 at 18:29
  • > but how can I open a new thread? you already wrote it `th = threading.Thread(target=self.main) th.start()`. Or it's not clear what you want. Please read https://stackoverflow.com/help/how-to-ask – imbr Aug 30 '21 at 18:30
  • I am trying to open totally new thread while a this thread is already opened – Flu_Py Developer Aug 30 '21 at 18:31
  • look, I am trying to make each browser opened to run simultaneously. – Flu_Py Developer Aug 30 '21 at 18:32
  • why cant you just repeat this piece of code `th = threading.Thread(target=self.main) th.start()`? Again read how to ask. You are not being clear. That's a principle: if you cannot explain what you are trying to do no one will be able to help you properly. Please read how to ask. – imbr Aug 30 '21 at 18:33
  • ok, this is what I want exactly, the user will open 'th' thread, it will open browser 'A' and do some stuff, while browser A is doing some stuff, the user will press the start button again and open browser B and do that same stuff, I want both of them to run together, the problem I faced is that when I run that thread function, the first browser stops and the second browser is opened – Flu_Py Developer Aug 30 '21 at 18:36
  • this is with details, sorry if I made it long – Flu_Py Developer Aug 30 '21 at 18:37
  • Do you get what I mean ? – Flu_Py Developer Aug 30 '21 at 18:38
  • @eusoubrasileiro can you please refer to my detailed question. – Flu_Py Developer Aug 30 '21 at 18:48
  • @Flu_PyDeveloper one more question. What exact task are you doing with selenium chrome webdriver? – imbr Aug 31 '21 at 11:51

2 Answers2

1

Threading for selenium speed up

Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup. The list of pages is links.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading

def create_driver():
   """returns a new chrome webdriver"""
   chromeOptions = webdriver.ChromeOptions()
   chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
   return webdriver.Chrome(options=chromeOptions)  

def get_title(url, webdriver=None):  
   """get the url html title using BeautifulSoup 
   if driver is None uses a new chrome-driver and quit() after
   otherwise uses the driver provided and don't quit() after"""
   def print_title(driver):
      driver.get(url)
      soup = BeautifulSoup(driver.page_source,"lxml")
      item = soup.find('title')
      print(item.string.strip())

   if webdriver:
      print_title(webdriver)  
   else: 
      webdriver = create_driver()
      print_title(webdriver)   
      webdriver.quit()

links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/", 
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]

Calling now get_tile on the links above.

Sequential approach

A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).

start_time = time.time()
driver = create_driver()

for link in links: # could be 'like' clicks 
  get_title(link, driver)  

driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")

Multiple threads approach

Using a thread for each link. Results in 10.5 s > 2x faster.

start_time = time.time()    
threads = [] 
for link in links: # each thread could be like a new 'click' 
    th = threading.Thread(target=get_title, args=(link,))    
    th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
    threads.append(th)        
for th in threads:
    th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")

This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool. And suggests that storing the chrome-driver instance initialized on each thread is faster than creating-starting it every time.

Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).

Processes for selenium speed up

To try to overcome the Python GIL limitation using the package multiprocessing and Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title function above. Additional code is here.

start_time = time.time() 

processes = [] 
for link in links: # each thread a new 'click' 
    ps = multiprocessing.Process(target=get_title, args=(link,))    
    ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
    processes.append(ps)        
for ps in processes:
    ps.join() # Main wait for processes finish

return (time.time() - start_time)

Contrary of what I would expect Python multiprocessing.Process based parallelism for selenium in average was around 8% slower than threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium chrome-driver commands uses HTTP-Requets (like POST, GET) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.

Threading a good start for selenium speed up **

This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing have many limitations in this case. Each new Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.

Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).

imbr
  • 6,226
  • 4
  • 53
  • 65
0

try this:

def thread_(self):
    th = threading.Thread(target=self.main)
    self.jobs.append(th)
    th.start()

info: https://pymotw.com/2/threading/