I have a very basic sitemap scraper built in Python 3 using requests and lxml. The aim is to build a database of the URLs of a certain website. Currently the way it works is the following: for each top-level sitemap to be scraped, I trigger a celery task. In this task, the sitemap is parsed to check whether it's a sitemapindex
or a urlset
. Sitemapindexes point to other sitemaps hierarchically, whereas urlsets point to end urls - they're like the leafs in the tree.
If the sitemap is identified as a sitemapindex
, each URL it contains, which points to a sub-sitemap, is processed in a separate thread, repeating the process from the beginning.
If the sitemap is identified as a urlset
, the URLs within are stored in the database and this branch finishes.
I've been reading about coroutines, asyncio, gevent, async/await, etc and I'm not sure if my problem is suitable to be developed using these technologies or whether performance would be improved.
As far as I've read, corroutines are useful when dealing with IO operations in order to avoid blocking the execution while the IO operation is running. However, I've also read that they're inherently single-threaded, so I understand there's no parallelization when, e.g., the code starts parsing the XML response from the IO operation.
So esentially the questions are, how could I implement this using coroutines/asyncio/insert_similar_technology? and would I benefit from it performance-wise?
Edit: by the way, I know Twisted has a specialized SitemapSpider, just in case anyone suggests using it.