As I mentioned that I am using Python multiprocessing to scrape millions of records where I spawn 30-50 processes at once that execute a certain method. What I found is that is it not resource friendly; spawning multiple python
processes at once. What are efficient and resource-friendly alternatives are provided by modern python versions other than threading?
Asked
Active
Viewed 467 times
0
-
1You can look at `asyncio` module. – Andrej Kesely Aug 16 '22 at 19:06
-
@AndrejKesely is it good for scraping millions of records and I/O operations like storing in Database? – Volatil3 Aug 16 '22 at 19:08
-
I don't know the specifics but `asyncio` (`aiohttp` especially) is more resource friendly - it doesn't spawn new process with its own python interpreter... – Andrej Kesely Aug 16 '22 at 19:10
-
What's wrong with threading, btw? There aren't _that_ many [different concurrency model options](https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio-in-python-3) (for running on a single computer) – Kache Aug 16 '22 at 19:10
-
Does this answer your question? [multiprocessing vs multithreading vs asyncio in Python 3](https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio-in-python-3) – Kache Aug 16 '22 at 19:12
-
@AndrejKesely did you ever use aiohttp for hundred thousands urls? If so, what was the ram usage when loading the url list? – Barry the Platipus Aug 16 '22 at 19:37
-
@platipus_on_fire Never used it in that scale. But I imagine the RAM usage won't be big: not to store anything in memory (e.g. to not load the URLs into a list - but read them lazily from DB for example, use a working queue etc.). Also not store the results from scraping in memory (again, store it into DB as soon as possible). – Andrej Kesely Aug 16 '22 at 19:41