0

I have a list of multiple urls. I wanted to scrape them all using one function. I used the function below but the issue with it is after some time it uses up all the memory in the system which makes the system crash. I want a solution so that I can use 5-10 threads at one time.

    from threading import Thread
    threads = []
    for url in input:
        count += 1
        _thread1 = Thread(target=self.hitandsave, args=(url, count))
        _thread1.start()
        threads.append(_thread1)
        time.sleep(0.2)
    for t in threads:
        t.join()

I tried the codes above but I want a python code which iterates through the list and uses 5 to 10 threads at a time.

I tried this https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor as well. but after some time system got hang and process get kill.

Mayuri
  • 1
  • 2
  • You let someone else worry about that, because python multi-threading is a can of worms you don't want to open. Find a scraping tool that takes sensible config or runtime flags that someone else already wrote, and just use that. (Not to mention that Python's thread scheduler can't do concurrency, even if you make 20 threads, it can only run one of them at a time, thanks to Python's [Global Interpreter Lock](https://realpython.com/python-gil/)) – Mike 'Pomax' Kamermans Aug 18 '23 at 06:35
  • @Mike'Pomax'Kamermans Multithreading in python is perfectly suitable for IO based tasks like webscrapping. No need to use an external tool – mousetail Aug 18 '23 at 06:41
  • Python can't download a page and then process the data while at the same time downloading the next page, so I'd argue that no, it's not even suitable for (efficient bulk) web scraping. – Mike 'Pomax' Kamermans Aug 18 '23 at 06:42
  • @Mike'Pomax'Kamermans Yes it can download a page and process the next page at the same time. It just can't process 2 pages at the same time, and even that is possible if part of the processing is using numpy or pandas or some other C based library – mousetail Aug 18 '23 at 06:43
  • Which means it can't do what is being asked, running 5-10 concurrent threads. – Mike 'Pomax' Kamermans Aug 18 '23 at 06:44
  • @Mike'Pomax'Kamermans Of course it can, you can easily spend 90% of your time downloading, which makes 10 threads a perfectly reasonable way to increase performance – mousetail Aug 18 '23 at 06:45
  • Then feel free to post an answer, but that's still using the wrong tool if you want an efficient bulk scraper. People already wrote a bunch of those, find one (or more) and just start using them. – Mike 'Pomax' Kamermans Aug 18 '23 at 06:46
  • To address the actual question, you want the [TheadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor) which is part of the standard library and allows automatically dividing tasks over a fixed number of threads – mousetail Aug 18 '23 at 06:49
  • @mousetail TheadPoolExecutor works for me. Thanks! – Mayuri Aug 18 '23 at 07:02
  • @mousetail I tried your solution it works for me.but after sometimes system got hang and the process got stoped by system – Mayuri Aug 25 '23 at 06:16

0 Answers0