Separate process for my time consuming task (flask application)

Question

I am having a problem similar to the one in this question. I have a flask application that takes input from the user (sometimes multiple thousand addresses), then processes it (cleans/geocodes), then returns a results page after everything is done. During this time, the page remains loading. This loading time could potentially be up to 15 minutes, depending on the size of the input. The application can process roughly 300 address per minute.

I saw one of the answers say that it could potentially be solved by putting all of the work on a separate process and redirecting the user to sort of a 'Loading Please Wait' page, and then after that is complete, redirect the user to the results page.

I was wondering what all of this would entail.

Here is a simplified version of my code, excluding import statements etc.: (I am currently using gunicorn to serve the application)

app = Flask(__name__)

@app.route("/app")
def index():
    return """
          <form action="/clean" method="POST"><textarea rows="25" cols="100"
          name="addresses"></textarea>
          <p><input type="submit"></p>
          </form></center></body>"""

@app.route("/clean", methods=['POST'])
def dothing():
    addresses = request.form['addresses']
    return cleanAddress(addresses)

def cleanAddress(addresses):
    ....
    ....
    .... processes each address 1 by 1, 
    .... then adds it to list to return to the user
    ....
    ....
    return "\n".join(cleaned) #cleaned is the list containing the output

I was told that Celery could be used to help me do this.

Here is my current attempt with Celery. I am still getting the same error where the page times out, however like usual, from the console I can see that the application is still working...

app = Flask(__name__)

app.config['CELERY_BROKER_URL'] = 'redis://0.0.0.0:5000'
app.config['CELERY_RESULT_BACKEND'] = 'redis://0.0.0.0:5000'

celery = Celery(app.name, broker = app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)

@app.route("/clean", methods=['POST'])
def dothing():
    addresses = request.form['addresses']
    return cleanAddress(addresses)

@celery.task
def cleanAddress(addresses):
    ....
    ....
    .... processes each address 1 by 1, 
    .... then adds it to list to return to the user
    ....
    ....
    return "\n".join(cleaned) #cleaned is the list containing the output

After the application finishes running, I am given this console error:

Traceback (most recent call last):
  File "/home/my name/anaconda/lib/python2.7/SocketServer.py", line 596, in process_request_thread
    self.finish_request(request, client_address)
  File "/home/my name/anaconda/lib/python2.7/SocketServer.py", line 331, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/home/my name/anaconda/lib/python2.7/SocketServer.py", line 654, in __init__
    self.finish()
  File "/home/my name/anaconda/lib/python2.7/SocketServer.py", line 713, in finish
    self.wfile.close()
  File "/home/my name/anaconda/lib/python2.7/socket.py", line 283, in close
    self.flush()
  File "/home/my name/anaconda/lib/python2.7/socket.py", line 307, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe

score 1 · Accepted Answer · answered Aug 15 '16 at 14:27

You're not running the task in the background. Use delay or apply_async to run the task in the background. Calling it directly executes it synchronously.

task = cleanAddress.delay(address)
return jsonify(task.id)

Respond with the task's id, then poll its state with a separate view to determine whether the results are ready.

from celery.states import SUCCESS
task = cleanAddress.AsyncResult(id)
return jsonify(task.state == SUCCESS)

Storing the state (and the results) requires both the broker and results backends to be configured. By default, there is no results backend configured, so all state is discarded.

score 0 · Answer 2 · answered Aug 15 '16 at 14:11

0

Using celery leaves out a bit- it allows you to kick off a background process, but does not help your problem of things timing out! My recommended solution is to use something like redis or your database to store the results... when someone visits the "kick off this job" url they get back a message "starting the process, please check '/results' in 15 minutes or so" and sets a flag in the db to "in work".

The celery process kicks off and stores the results in database somewhere. When it finishes, it sets the flag to "finished".

When someone goes to /results they get the results from the db if the flag is set to "finished" or a message saying "still working" if the flag is "in work".

answered Aug 15 '16 at 14:11

Paul Becotte

9,767
3
34
42

Is there any other way to prevent the timing out, other than the way that you've described? – Harrison Aug 15 '16 at 14:12
True- if you have a handle to the id of the task. If you want to set it up so you can wander away and come back later, you have to store that task id somewhere... whether in the html of the response page (that does the polling) or in a database entry or whatever. I didn't mean to say that Celery wasn't the solution- just that launching a Celery task isn't the whole answer- you also need to think about how to get the results of that task back to the user when. – Paul Becotte Aug 15 '16 at 15:17

Separate process for my time consuming task (flask application)

2 Answers2