1

I'm trying to decide if I should use gevent or threading to implement concurrency for web scraping in python.

My program should be able to support a large (~1000) number of concurrent workers. Most of the time, the workers will be waiting for requests to come back.

Some guiding questions: What exactly is the difference between a thread and a greenlet? What is the max number of threads \ greenlets I should create in a single process (with regard to the spec of the server)?

  • Too broad? If you have multiple questions, ask multiple questions. – user202729 Feb 04 '18 at 14:03
  • You should also consider asyncio, I think. Fo answering your questions, greenlets are typically lower weight than normal threads, which means there is less overhead in creating many and switching between them. However, as a consequence they share more between them, which, in certain cases, can be a problem. You really need to look into a longer tutorial on the matter to get a better idea. – JohanL Feb 04 '18 at 16:18
  • When it comes to pure IO, gevent is definitely the better option to threads. See my comment on this [answer](https://stackoverflow.com/a/51932442/2089675). – smac89 Sep 13 '22 at 03:52

2 Answers2

1

The python thread is the OS thread which is controlled by the OS which means it's a lot heavier since it needs context switch, but green threads are lightweight and since it's in userspace the OS does not create or manage them.

I think you can use gevent, Gevent = eventloop(libev) + coroutine(greenlet) + monkey patch. Gevent give you threads but without using threads with that you can write normal code but have async IO.

Make sure you don't have CPU bound stuff in your code.

smac89
  • 39,374
  • 15
  • 132
  • 179
panda912
  • 196
  • 1
  • 6
0

I don't think you have thought this whole thing through. I have done some considerable lightweight thread apps with Greenlets created from the Gevent framework. As long as you allow control to switch between Greenlets with appropriate sleep's or switch's -- everything tends to work fine. Rather than blocking or waiting for a reply, it is recommended that the wait or block timeout, raise and except and then sleep (in except part of your code) and then loop again - otherwise you will not switch Greenlets readily.

Also, take care to join and/or kill all Greenlets, since you could end up with zombies that cause copious effects that you do not want.

However, I would not recommend this for your application. Rather, one of the following Websockets extensions that use Gevent... See this link

Websockets in Flask

and this link

https://www.shanelynn.ie/asynchronous-updates-to-a-webpage-with-flask-and-socket-io/

I have implemented a very nice app with Flask-SocketIO

https://flask-socketio.readthedocs.io/en/latest/

It runs through Gunicorn with Nginx very nicely from a Docker container. The SocketIO interfaces very nicely with Javascript on the client side.

(Be careful on the webscraping - use something like Scrapy with the appropriate ethical scraping enabled)

barnwaldo
  • 386
  • 3
  • 8