What is the alternative of multiprocessing to scrape multiple URLs at once?

Asked Aug 16 '22 at 19:04

Active Aug 22 '22 at 16:06

Viewed 467 times

As I mentioned that I am using Python multiprocessing to scrape millions of records where I spawn 30-50 processes at once that execute a certain method. What I found is that is it not resource friendly; spawning multiple python processes at once. What are efficient and resource-friendly alternatives are provided by modern python versions other than threading?

edited Aug 22 '22 at 16:06

mkrieger1

19,194
5
54
65

asked Aug 16 '22 at 19:04

Volatil3

14,253
38
134
263

1

You can look at `asyncio` module. – Andrej Kesely Aug 16 '22 at 19:06
@AndrejKesely is it good for scraping millions of records and I/O operations like storing in Database? – Volatil3 Aug 16 '22 at 19:08
I don't know the specifics but `asyncio` (`aiohttp` especially) is more resource friendly - it doesn't spawn new process with its own python interpreter... – Andrej Kesely Aug 16 '22 at 19:10
What's wrong with threading, btw? There aren't _that_ many [different concurrency model options](https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio-in-python-3) (for running on a single computer) – Kache Aug 16 '22 at 19:10
Does this answer your question? [multiprocessing vs multithreading vs asyncio in Python 3](https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio-in-python-3) – Kache Aug 16 '22 at 19:12
@AndrejKesely did you ever use aiohttp for hundred thousands urls? If so, what was the ram usage when loading the url list? – Barry the Platipus Aug 16 '22 at 19:37
@platipus_on_fire Never used it in that scale. But I imagine the RAM usage won't be big: not to store anything in memory (e.g. to not load the URLs into a list - but read them lazily from DB for example, use a working queue etc.). Also not store the results from scraping in memory (again, store it into DB as soon as possible). – Andrej Kesely Aug 16 '22 at 19:41

What is the alternative of multiprocessing to scrape multiple URLs at once?

0 Answers0