0

As I mentioned that I am using Python multiprocessing to scrape millions of records where I spawn 30-50 processes at once that execute a certain method. What I found is that is it not resource friendly; spawning multiple python processes at once. What are efficient and resource-friendly alternatives are provided by modern python versions other than threading?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Volatil3
  • 14,253
  • 38
  • 134
  • 263
  • 1
    You can look at `asyncio` module. – Andrej Kesely Aug 16 '22 at 19:06
  • @AndrejKesely is it good for scraping millions of records and I/O operations like storing in Database? – Volatil3 Aug 16 '22 at 19:08
  • I don't know the specifics but `asyncio` (`aiohttp` especially) is more resource friendly - it doesn't spawn new process with its own python interpreter... – Andrej Kesely Aug 16 '22 at 19:10
  • What's wrong with threading, btw? There aren't _that_ many [different concurrency model options](https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio-in-python-3) (for running on a single computer) – Kache Aug 16 '22 at 19:10
  • Does this answer your question? [multiprocessing vs multithreading vs asyncio in Python 3](https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio-in-python-3) – Kache Aug 16 '22 at 19:12
  • @AndrejKesely did you ever use aiohttp for hundred thousands urls? If so, what was the ram usage when loading the url list? – Barry the Platipus Aug 16 '22 at 19:37
  • @platipus_on_fire Never used it in that scale. But I imagine the RAM usage won't be big: not to store anything in memory (e.g. to not load the URLs into a list - but read them lazily from DB for example, use a working queue etc.). Also not store the results from scraping in memory (again, store it into DB as soon as possible). – Andrej Kesely Aug 16 '22 at 19:41

0 Answers0