0

I already came to an understanding how multiprocessing and multithreading can speed up a program:

  • Multiprocessing, is used for CPU bound tasks
  • Multithreading is used for network bound tasks

What if the task I am performing is both CPU bound and network bound?

My project is a selenium webscraper that would cycle through a list of keywords to search on Amazon. After searching for each keyword, I would extract the contents of all products on the first page (title, price, reviews, shipping methods etc.) and output those contents into an excel document.

I have some major blockages from this project:

  • There are 3,500+ keywords I need to scrape everyday and I can cycle around one keyword every 12 seconds using only one thread and one process. This needs to be sped up, however I seem to have maxed out my CPU and RAM when running the program (i5 and 16GB). Since I have maxed out my usage, would adding threads or processes help efficiency?
  • A major time component on the CPU side is parsing through each product contents then placing them in the correct column in my excel document. Essentially, Amazon does not make it easy to scrape their website meaning it is hard to distinguish a pattern in the HTML for easy pulling. Instead of pulling multiple small elements from each product (title, price, reviews, etc.) I resorted to one big pull where I captured all product contents THEN built an algorithm that would parse through all the information and upload it to the correct spot on the excel document.

The majority of run time seems to be spent parsing the information through the algorithm and uploading to the excel document. Keeping in mind my CPU and RAM usage is maxed out, would multithreading and multiprocessing do anything to increase efficiency?

Note: I can provide a code example, but for simplicity I left it out. I realize the easy answer may be: "upload to a server" but I wanted to use that as a last resort.

Luke Hamilton
  • 637
  • 5
  • 19
  • The answer is no... but it does sound like you're doing too much at one time. You can gather your data from amazon and then write it out after the driver has closed. (Determine how much data your RAM can take first... but 16GB should be plenty to hold a large dataset.) Use one excel file for each run. This will probably speed up your excel export algorithm. I would then import each excel sheet into an SQL database. So break it down into 3 tasks run in different times. Web browsers take up a lot of resources so best to do data-output tasks after the driver is closed. – pcalkins Oct 26 '21 at 21:41
  • @pcalkins Thank you for your response! I have a database that this will feed into, I am simply using excel as an intermediary. After the program finishes running, it will upload the excel file to the database then clear the excel file contents. Is that the best way to upload? I will try your suggestion on compiling the data AFTER the browser is closed. – Luke Hamilton Oct 26 '21 at 21:54
  • @pcalkins Currently, I am transitioning to the next URL without clearing my browser, may be another reason why it is overloaded because it stores those cookies! However, it takes around 12 seconds for the first keyword and around 12 seconds for the nth keyword (n being 200 from a test simulation). It seems like I may have to transition my efforts over to a server, unless you have any more helpful insight. – Luke Hamilton Oct 26 '21 at 21:58
  • cookies and caches are written to disk, so I don't think that has much of an effect. If you are only running one webdriver it shouldn't be maxing out your CPU and memory though. Not sure why that's happening. – pcalkins Oct 26 '21 at 22:05
  • One thing you might consider is just writing directly to the database. Doesn't seem like you ever need to write out an excel file. Just go from memory (array) to database after each chunk of data is grabbed. – pcalkins Oct 26 '21 at 22:08
  • Have you attempted scraping with scrappy, requests ? as they are far lighter on resources. And what exactly are you scraping, as even with selenium > 1000 pages an hour with < 1 GB on one core use is normal. Also saving in panda dataframe and then export that to excel might help. – karel van dongen Oct 26 '21 at 22:18
  • cURL could be an option too... but I'm assuming you have to use Selenium here. I think the key to optimizing memory usage here would be to only get/store the data you need. (So just use the webelement methods to get what it is you want... don't store/pass that around... and especially don't do that in a bunch of different threads...) Those methods will only be valid while you are on a particular page (or particular instance of the DOM) anyway. – pcalkins Oct 26 '21 at 23:04
  • @karelvandongen I have looked into using scrappy, it has extraordinary speed however it's ability to parse through HTML (from my understanding) is limited. Also, would you happen to have a framework that would set me up from running scrappy via a non-shell perspective? – Luke Hamilton Oct 27 '21 at 17:17
  • @pcalkins I do not need to use selenium, what other options would you recommend I use for this particular scenario? It needs to keep in mind efficiency as well as the ability to parse through a webpages HTML and remain undetected without the use of IP cycling (aka I want to deploy this scraper for free and not want any hassle getting blacklisted). – Luke Hamilton Oct 27 '21 at 18:14
  • @Luke Hamilton HTML never a problem for scrapy. Java script sometimes is. And with https://www.scrapingbee.com/blog/scrapy-javascript/ or https://pypi.org/project/splash/ you can do a lot. And you can test the scapping with the scrapy shell, and proxys is not a problem, combine that with slowing the requests a bit (build in). and for a more graphic interface https://www.zyte.com. – karel van dongen Oct 27 '21 at 18:29
  • you might check out Pandas as well: https://pandas.pydata.org/docs/index.html – pcalkins Oct 27 '21 at 20:52

1 Answers1

0

WebDriver is not thread-safe. That being said, still you can serialise access to the underlying WebDriver instance, you can share a reference in more than one thread. But this is not advisable. But you can always instantiate one WebDriver instance for each thread.

Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time just like simulating a real user. But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your ideas are perfect.

Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use or to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the desired tab/window.

However, a viable solution may be to use the remote.webdriver which is an Abstract Base Class for all Webdriver subtypes. Abstract Base Class would allow custom implementations of Webdriver to be registered so that isinstance type checks will succeed.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Does thread safety have anything to do with speed? Also, I have updated this code since the post by using pandas instead of xlwings (much less memory usage) and I am able to pull, one average, one entire product page every 1.6 seconds using multithreading. Seeing how I resolved the majority of my GPU/RAM constraints switching over to pandas, wouldn't it be best to use multithreading instead of multiprocessing because it is an I/O bound task? Just wondering as you discussed both of them, do you know the difference between them? – Luke Hamilton Nov 28 '21 at 05:31
  • I will look into remote webdrivers, but again, how this have to do with efficiency? Efficiency is the bottom line here. As always, I appreciate your help! – Luke Hamilton Nov 28 '21 at 05:33
  • Apparently it's true _thread safety have anything to do with speed_. But when you speak of multithreading or/and multiprocessing with Selenium, those were the experiences we have gathered over time. – undetected Selenium Nov 28 '21 at 06:47
  • If you still have confusion feel free to speak to me within the [Selenium](https://chat.stackoverflow.com/rooms/223360/selenium) room. – undetected Selenium Nov 28 '21 at 18:19