1

I have to scrap the quotes data from an insurance company website which requires filling the form which has 12-14 fields and clicking on a button and wait for 14-16 seconds to get the results and once results are available, return them to REST API. I used python and selenium to open the chrome browser when I receive the request for API, fill the UI elements from API request parameter and once quote is available, returned it. Everything works for 1 request at a time but when I did a load test with 100 parallel requests which opened 100 browser windows and it crashed, only processed 5 requests.

I have a virtual machine with 200 GB of RAM and powerful CPU but it did not work there either. I have to open the browser which seems like a problem to me. What alternative or solution can I use as we have a benchmark of 1000 parallel requests at a time. For static data scrapping I worked in the past and that's an easy job but with dynamic content there are problems. Anyone who faced such issues or worked previously please share what's the best path for this. I am thinking of using multiple virtual machines and redirect requests to them accordingly but I'm not sure if that's a good path or not. There are tools like Microsoft Power Automate, UiPath etc. but I'm not sure if they can handle my use case or not. I have already spent a week to create this scraping script but from results it seems like there has to some other path.

Guarav T.
  • 468
  • 6
  • 20
  • Can you confirm the website of that insurance company, is it publicly accesssible? – Barry the Platipus Oct 13 '22 at 05:03
  • Not sure what exact affect it has on memory, but if you are not already you can try running selenium in [headless mode](https://stackoverflow.com/questions/53657215/running-selenium-with-headless-chrome-webdriver) so that it doesn't actually try to display the browser windows and that might help – Modularizer Oct 13 '22 at 05:33
  • @Modularizer: Thanks for suggestion and I also read some articles for headless mode in selenium but I am not sure is it really gonna make that big impact, I mean if with headless mode it take from 100 to 200 but get failed in 300 then we again need to find something else. We have 1000 as benchmark and we will expect at least 300-500 when launch our marketing. I want to know if there is some better way for this. – Guarav T. Oct 13 '22 at 09:49
  • @BarrythePlatipus: I cannot share the details of website but it has 14 fields including radio button and dropdown etc and when we click on Run Quote button it takes around 10-20 seconds to see the results. For one request our code already work, problem here is concurrencyy. – Guarav T. Oct 13 '22 at 09:52
  • is that website (you cannot share) publicly accessible? – Barry the Platipus Oct 13 '22 at 09:53
  • @BarrythePlatipus No, its behind login. – Guarav T. Oct 13 '22 at 09:58
  • Please check if there is an API for the POST request. I am pretty sure there is. – kwoxer Oct 14 '22 at 06:38
  • @kwoxer If there is any API then I should not be going through this, insurance companies in US are in old ages. – Guarav T. Oct 14 '22 at 08:14

0 Answers0