55

I have web application written in Flask. As suggested by everyone, I can't use Flask in production. So I thought of Gunicorn with Flask.

In Flask application I am loading some Machine Learning models. These are of size 8GB collectively. Concurrency of my web application can go upto 1000 requests. And the RAM of machine is 15GB.
So what is the best way to run this application?

malioboro
  • 3,097
  • 4
  • 35
  • 55
neel
  • 8,399
  • 7
  • 36
  • 50

2 Answers2

72

You can start your app with multiple workers or async workers with Gunicorn.

Flask server.py

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

if __name__ == "__main__":
    app.run()

Gunicorn with gevent async worker

gunicorn server:app -k gevent --worker-connections 1000

Gunicorn 1 worker 12 threads:

gunicorn server:app -w 1 --threads 12

Gunicorn with 4 workers (multiprocessing):

gunicorn server:app -w 4

More information on Flask concurrency in this post: How many concurrent requests does a single Flask process receive?.

Community
  • 1
  • 1
molivier
  • 2,146
  • 1
  • 18
  • 20
  • 9
    With multiple workers it is throwing out of memory exception as size of models is large. I think with each worker it will load all models in different memory space – neel Mar 07 '16 at 09:02
  • You need to use async worker like gevent to allow concurrency with one worker: `gunicorn -k gevent --worker-connections 1000`. – molivier Mar 07 '16 at 09:10
  • You can also add `--threads` to run each worker with the specified number of threads. See Edit. – molivier Mar 07 '16 at 09:29
  • Which worker type should I use if my api call is taking around 1sec? – neel Mar 09 '16 at 10:45
  • 2
    I'll go with gevent and monkey patch your app: http://stackoverflow.com/questions/29527351/async-worker-on-gunicorn-seems-blocking. You can also have a look to celery to run background tasks. – molivier Mar 09 '16 at 11:01
  • @neel did you got any solution, I've same problem – Yasar Arafath Sep 06 '21 at 07:07
15

The best thing to do is to use pre-fork mode (preload_app=True). This will initialize your code in a "master" process and then simply fork off worker processes to handle requests. If you are running on linux and assuming your model is read-only, the OS is smart enough to reuse the physical memory amongst all the processes.

slushi
  • 1,414
  • 13
  • 22
  • 1
    The problem I found with this is that if you are running databases, they will complain that the database instance shouldn't be forked. In my case, MongoClient from pyMongo `/usr/local/lib/python3.8/site-packages/pymongo/topology.py:164: UserWarning: MongoClient opened before fork. Create MongoClient only after forking.` – carkod Nov 16 '21 at 08:29
  • 1
    yes, in those cases you need to be sure to initialize those connections post fork. after that it should be fine. – slushi Nov 16 '21 at 19:50
  • I confirm, such memory reuse in Linux systems (such as our Ubuntu containers), is evident also for multiple `optuna` scripts (training rather memory-intensive ML models) executed in parallel using multiprocessing. – mirekphd Dec 11 '22 at 10:38