1

I am working on a Streamlit app that runs properly locally, but when I initiate the web scraping process with multiple threads, the website freezes and the process is killed. The logs in the console indicate the links are being scraped, so I am not sure what is causing the issue. Does anyone have any ideas as to why this is happening?

2023-02-20 19:50:11.308 Get LATEST chromedriver version for google-chrome 110.0.5481
2023-02-20 19:50:11.310 Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/110.0.5481/chromedriver] found in cache

enter image description here

multithreading func:

 threads = []
        for i in links:
            t = threading.Thread(target=get_links, args=(i, resumeContent))
            threads.append(t)
            t.start()
        for t in threads:
            t.join()
Mansidak
  • 136
  • 6

1 Answers1

1

You are using and executable_path has been deprecated and you have to pass in a Service object.

So effectively, instead of:

driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

you need to pass:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

so once the matching ChromeDriver is downloaded, it can be reused.


Update

As per your question update, as you implemented Threading, the number of threads you are trying to spawn will always be a matter of concern provided you have only 2 GB of memory.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • I see. That makes sense. Something interesting I noticed is that I had a function that used multi threading to run a function. When I changed that to a simple for loop (which takes 10x longer obviously) I didn't see that error anymore. You think the machine kills the process because it's too heavy for the t2.small tier? – Mansidak Feb 20 '23 at 21:38
  • @Mansidak Possibly it does have a domino effect. – undetected Selenium Feb 20 '23 at 21:39
  • I just tried your solution and still getting the same error. I'm suspecting it's the machine. You think it'll be worth it to try upgrading to a higher tier one just for testing? I wonder why despite having a 2gb memory already a simple multi threaded function is freezing the server. – Mansidak Feb 20 '23 at 21:47
  • @Mansidak Ah, _`2gb`_ isn't enough for _multi threading_ atleast. – undetected Selenium Feb 20 '23 at 21:49
  • When I ran it on streamlit, it ran like a breeze and they only have me 1gb lol – Mansidak Feb 20 '23 at 21:50
  • Also, I might be mis-using the term multithreading. The code I'm referring to is this where get_links function opens a browser for each link in links : threads = [] for i in links: t = threading.Thread(target=get_links, args=(i, resumeContent)) threads.append(t) t.start() for t in threads: t.join() – Mansidak Feb 20 '23 at 21:51
  • Just edited the question for cleaner reference – Mansidak Feb 20 '23 at 21:52
  • @Mansidak Ofcoarse you are using _Threading_ and the number of threads you are trying to spawn will always be a matter of concern provided you have only _2 GB_ of memory. – undetected Selenium Feb 20 '23 at 21:54
  • 1
    Lol just tried upgrading to t2.large with 8gb and it worked. – Mansidak Feb 20 '23 at 22:03