1

I am using sqlalchemy to write to a mysql db where I am indexing some files and storing their contents. I need to write the files and then write the index entries which have a foreign key to the files table. However, sqlalchemy seems to be issuing the INSERT statements out of order.

Here is a minimal functional example illustrating the problem using mock random data (less the configuration file, which contains server specific info):

Index/ORM.py:

#!/bin/env python2.7

from __future__ import print_function

import os

from sqlalchemy import Column, ForeignKey, Integer, String
from sqlalchemy.dialects.mysql import LONGBLOB, INTEGER
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship, sessionmaker
from sqlalchemy import create_engine

from Index import load_cfg

class Base(object):
    """
    Basic MySQL table settings
    """
    __table_args__ = {
            'mysql_engine': 'InnoDB',
            'mysql_collate': 'latin1_general_cs'
            }

Base = declarative_base(cls=Base)

class CoverageIndex(Base):
    """
    Class for coverage_index table objects
    """
    __tablename__ = 'coverage_index'

    filename = Column(String(45), primary_key=True)
    #filename = Column(String(45), ForeignKey("files.filename"), primary_key=True)
    sequence_id = Column(String(45), primary_key=True, index=True)

    def __init__(self, filename, sequence_id):
        self.filename = filename
        self.sequence_id = sequence_id

class FileRow(Base):
    """
    Class for files stored in db
    """
    __tablename__ = 'files'

    filename = Column(String(45), primary_key=True)
    contents = Column(LONGBLOB)

    def __init__(self, filename, contents):
        self.filename = filename
        self.contents = contents

cfg = load_cfg()
db_string = 'mysql://%(user)s:%(passwd)s@%(host)s/%(db)s' % cfg['db_config']
engine = create_engine(db_string, echo=True)
Base.metadata.create_all(engine)

if __name__ == '__main__':
    pass

index.py:

#!/usr/bin/env python2.7

from __future__ import print_function

import os
import sys

from sqlalchemy import Column, ForeignKey, Integer, String
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.exc import IntegrityError

from Index.ORM import Base, CoverageIndex, FileRow, engine as db_engine

if __name__ == '__main__':
    import string, random

    data = {}
    for i in range(0,10):
        file = 'file' + str(i)
        data[file] = {
                'seqs': ['seqa' + str(i), 'seqb' + str(i)],
                'contents': '\n'.join([''.join([random.choice(string.letters) for x in range (0, 80)]) for y in range (0, 2500)])}
    #print (data)

    Base.metadata.bind = db_engine

    DBSession = sessionmaker(bind=db_engine)
    session = DBSession()

    for file, datum in data.iteritems():
        file_query = session.query(FileRow).filter(FileRow.filename == file)
        if file_query.count() > 0:
            session.query(CoverageIndex).filter(CoverageIndex.filename == file).delete(synchronize_session='fetch')
            file_query.delete(synchronize_session='fetch')
        for i in datum['seqs']: 
            # Write to DB
            fqc = file_query.count() 
            print ("No. of files: " + str(fqc))
            if fqc == 0:
                print ("Adding: ")
                fr = FileRow(
                        filename = file,
                        contents = datum['contents']
                        )
                session.add(fr)
            cov = CoverageIndex(
                    filename = file, 
                    sequence_id = i) 
            session.add(cov)
        try:
            session.commit()
        except:
            #print ("SQL Commit Failed: %s" % file)
            session.rollback()
            session.close()
            raise
    session.close()

Here is a part of the output of one run. I would like to draw your attention to line 2018-03-13 16:05:40,291 and ...,292:

...
2018-03-13 16:05:40,287 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)                                                                    
2018-03-13 16:05:40,288 INFO sqlalchemy.engine.base.Engine SELECT count(*) AS count_1                                                          
FROM (SELECT files.filename AS files_filename, files.contents AS files_contents                                                                
FROM files                                                                                                                                     
WHERE files.filename = %s) AS anon_1                                                                                                           
2018-03-13 16:05:40,288 INFO sqlalchemy.engine.base.Engine ('file1',)                                                                          
2018-03-13 16:05:40,290 INFO sqlalchemy.engine.base.Engine SELECT count(*) AS count_1                                                          
FROM (SELECT files.filename AS files_filename, files.contents AS files_contents                                                                
FROM files                                                                                                                                     
WHERE files.filename = %s) AS anon_1                                                                                                           
2018-03-13 16:05:40,290 INFO sqlalchemy.engine.base.Engine ('file1',)                                                                          
No. of files: 0                                                                                                                                
Adding:                                                                                                                                        
2018-03-13 16:05:40,291 INFO sqlalchemy.engine.base.Engine INSERT INTO coverage_index (filename, sequence_id) VALUES (%s, %s)                  
2018-03-13 16:05:40,291 INFO sqlalchemy.engine.base.Engine ('file1', 'seqa1')                                                                  
2018-03-13 16:05:40,292 INFO sqlalchemy.engine.base.Engine INSERT INTO files (filename, contents) VALUES (%s, %s)                              
2018-03-13 16:05:40,292 INFO sqlalchemy.engine.base.Engine ('file1', 'BkTsRJTcNEigPFjofFxDmwVZDXRAsPECawRUjiFZTDGWWoLZzLnGlCwQQeAFyXhLqKjPAJmme
mFNfVzF\nJlZSvwGAdoImTnBAmcrSdMRDvxNYnnMfbQXdfuXulqufiIYpqjFUgfElZSrVkvBvPTg ... (204700 characters truncated) ... trwtYOycEOuDTVxsXeGoNYKAqHlE
LGPqcimwzwAFAEsCZGBBnGzYMHgabgnGZaGmQsn\nSNjYvBwSVdXVKbmJpKdSHSXCDKKvDlkyLxOxsEfOtmlCRruqzaiPhYRocKZQEJSVrtSHncFMBMTEpWUX')                    
2018-03-13 16:05:40,310 INFO sqlalchemy.engine.base.Engine SELECT count(*) AS count_1                                                          
FROM (SELECT files.filename AS files_filename, files.contents AS files_contents                                                                
FROM files                                                                                                                                     
WHERE files.filename = %s) AS anon_1                                                                                                           
2018-03-13 16:05:40,310 INFO sqlalchemy.engine.base.Engine ('file1',)                                                                          
No. of files: 1                                                                                                                                
2018-03-13 16:05:40,311 INFO sqlalchemy.engine.base.Engine INSERT INTO coverage_index (filename, sequence_id) VALUES (%s, %s)                  
2018-03-13 16:05:40,311 INFO sqlalchemy.engine.base.Engine ('file1', 'seqb1')                                                                  
2018-03-13 16:05:40,312 INFO sqlalchemy.engine.base.Engine COMMIT       
...

Here, you can see that sqlalchemy is inserting the coverage_index before inserting the files object. I presume that this is because the files object is much bigger and taking some time to prepare, and so the engine decides to asynchronously run the later INSERT first.

However, the files entry needs to be inserted first, because filename in coverage_index should be a foreign key to files. (It throws an exception if I do this with the foreign key constraint defined)

I know I can commit after adding to files, but I would prefer that the files and coverage_index INSERT be in the same transaction, so they stay in sync.

So the question is, is there a way to force sqlalchemy to execute synchronously within a transaction?

ipetrik
  • 1,749
  • 18
  • 28

1 Answers1

0

Not sure if this is the best way, but it seems to achieve what I was going for:

flush(objects=None)

Flush all the object changes to the database.

Writes out all pending object creations, deletions and modifications to the database as INSERTs, DELETEs, UPDATEs, etc. Operations are automatically ordered by the Session’s unit of work dependency solver.

Database operations will be issued in the current transactional context and do not affect the state of the transaction, unless an error occurs, in which case the entire transaction is rolled back. You may flush() as often as you like within a transaction to move changes from Python to the database’s transaction buffer

Thanks to:

Is SQLAlchemy saves order in adding objects to session?

http://www.aosabook.org/en/sqlalchemy.html - Section 20.9 Unit of Work

ipetrik
  • 1,749
  • 18
  • 28