2

I have table with unique contraint on one colum like:

CREATE TABLE entity (
    id      INT NOT NULL AUTO_INCREMENT,
    zip_code    INT NOT NULL,
    entity_url  VARCHAR(255) NOT NULL,
    PRIMARY KEY (id),
    UNIQUE KEY ix_uniq_zip_code_entity_url (zip_code, entity_url)
);

and corresponding SQLAlchemy model. I adding a lot of records and do not want to commit session after each record. My assumption that better to call session.add(new_record) multiple times and one time session.commit().

But while adding new records I could get IntegrityError because constraint is violated. It is normal situation and I want just skip such records insertion. But looks like I can only revert entire transaction.

Also I do not want to add another complex checks "get all records from database where zip_code in [...] and entity_url in [...] then drop matched data from records_to_insert".

Is there way to instuct SQLAlchemy drop records which violates constraint?

Alex G.P.
  • 9,609
  • 6
  • 46
  • 81

2 Answers2

5

My assumption that better to call session.add(new_record) multiple times and one time session.commit().

You might want to revisit this assumption. Batch processing of a lot of records usually lends itself to multiple commits -- what if you have 10k records and your code raises an exception on the 9,999th? You'll be forced to start over. The core question here is whether or not it makes sense for one of the records to exist in the database without the rest. If it does, then there's no problem committing on each entry (performance issues aside). In that case, you can simply catch the IntegrityError and call session.rollback() to continue down the list of records.

In any case, a similar question was asked on the SQLA mailing list and answered by the library's creator, Mike Bayer. He recommended removing the duplicates from the list of new records yourself, as this is easy to do with a dictionary or a set. This could be as simple as a dict comprehension:

new_entities = { (entity['zip_code'], entity['url']): entity for entity in new_entities}

(This would pick the last-seen duplicate as the one to add to the DB.)

Also note that he uses the SQLAlchemy core library to perform the inserts, rather than the ORM's session.add() method:

sess.execute(Entry.__table__.insert(), params=inserts)

This is a much faster option if you're dealing with a lot of records (like in his example, with 100,000 records).

  • I suspected that there is not simple way to achive desired SQLAlchemy behavour. About multiple `session.commit()` - if table has complex indexes and a lot of data it could be quite expensive to commit each row. – Alex G.P. Apr 16 '16 at 22:15
  • Yes, multiple commits will be expensive, and you can achieve greater flexibility in processing the duplicate elements by validating before you begin INSERTing. – Willem Yarbrough Apr 18 '16 at 04:35
1

If you decide to insert your records row by row, you can check if it already exists before you do your insert. This may or may not be more elegant and efficient:

def record_exists(session, some_id):
    return session.query(exists().where(YourEntity.id == some_id)).scalar()

for item in items:
    if not record_exists(session, item.some_id):
        session.add(record)
        session.flush()
    else:
        print "Already exists, skipping..."

session.commit()
veuncent
  • 1,599
  • 1
  • 20
  • 17