2

I have a very basic sitemap scraper built in Python 3 using requests and lxml. The aim is to build a database of the URLs of a certain website. Currently the way it works is the following: for each top-level sitemap to be scraped, I trigger a celery task. In this task, the sitemap is parsed to check whether it's a sitemapindex or a urlset. Sitemapindexes point to other sitemaps hierarchically, whereas urlsets point to end urls - they're like the leafs in the tree.

If the sitemap is identified as a sitemapindex, each URL it contains, which points to a sub-sitemap, is processed in a separate thread, repeating the process from the beginning.

If the sitemap is identified as a urlset, the URLs within are stored in the database and this branch finishes.

I've been reading about coroutines, asyncio, gevent, async/await, etc and I'm not sure if my problem is suitable to be developed using these technologies or whether performance would be improved.

As far as I've read, corroutines are useful when dealing with IO operations in order to avoid blocking the execution while the IO operation is running. However, I've also read that they're inherently single-threaded, so I understand there's no parallelization when, e.g., the code starts parsing the XML response from the IO operation.

So esentially the questions are, how could I implement this using coroutines/asyncio/insert_similar_technology? and would I benefit from it performance-wise?

Edit: by the way, I know Twisted has a specialized SitemapSpider, just in case anyone suggests using it.

José Tomás Tocino
  • 9,873
  • 5
  • 44
  • 78

1 Answers1

2

Sorry, I'm not sure I fully understand how your code works, but here some thoughts:

Does your program downloads multiple urls?

If yes, asyncio can be used to reduce time your program waiting for network I/O. If not, asyncio wouldn't help you.

How does your program download urls?

If one-by-one, then asyncio can help you to grab them much faster. On other hand if you're already grabbing them parallely (with different threads, for example), you wouldn't get much benefit from asyncio.

I advice you to read my answer about asyncio here. It's short and it can help you to understand why and when to use asynchronous code.

Community
  • 1
  • 1
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159