I'm trying to process an entire csv file as fast as possible, so I'm looking to process each line in parallel as a celery task. The cleanup, which is also a celery task, has to wait until every line is processed. See the example below.
The problem is, I can't seem to get through a file because I keep running into Connection errors with MySQL. So far, I've seen these two errors: 2013, 'Lost connection to MySQL server during query'
and 2006, 'MySQL server has gone away'
from app.db.meta import Session
from celery import chord, Celery
from celery.signals import task_postrun
celery = Celery()
celery.config_from_object('config')
@task_postrun.connect
def close_session(*args, **kwargs):
Session.remove()
def main():
# process each line in parallel
header = [process_line.s(line) for line in csv_file]
# pass stats to cleanup after all lines are processed
callback = cleanup.s()
chord(header)(callback)
@celery.task
def process_line(line):
session = Session()
...
# process line
...
return stats
@celery.task
def cleanup(stats):
session = Session()
...
# do cleanup and log stats
...
I'm using celery 3.1.18 and SQLAlchemy 0.9.9. I'm using connection pooling as well.
mysql> SHOW FULL PROCESSLIST;
+----+------+-----------+-----------------+---------+------+-------+-----------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+-----------------+---------+------+-------+-----------------------+
| 1 | root | localhost | ab__development | Sleep | 4987 | | NULL |
| 11 | root | localhost | ab__development | Sleep | 1936 | | NULL |
| 16 | root | localhost | ab__development | Sleep | 143 | | NULL |
| 17 | root | localhost | ab__development | Sleep | 1045 | | NULL |
| 18 | root | localhost | NULL | Query | 0 | init | SHOW FULL PROCESSLIST |
| 21 | root | localhost | ab__development | Sleep | 7 | | NULL |
+----+------+-----------+-----------------+---------+------+-------+-----------------------+
6 rows in set (0.01 sec)