Python3 SQLAlchemy delete duplicates

Question

I'm using SQLAlchemy to manage a database and I'm trying to delete all rows that contain duplicates. The table has an id (primary key) and domain name.

Example:
ID| Domain
1 | example-1.com
2 | example-2.com
3 | example-1.com

In this case I want to delete 1 instance of example-1.com. Sometimes I will need to delete more than 1 but in general the database should not have a domain more than once and if it does, only the first row should be kept and the others should be deleted.

score 4 · Accepted Answer · answered Sep 06 '20 at 12:38

Assuming your model looks something like this:

import sqlalchemy as sa

class Domain(Base):
    __tablename__ = 'domain_names'

    id = sa.Column(sa.Integer, primary_key=True)
    domain = sa.Column(sa.String)

Then you can delete the duplicates like this:

# Create a query that identifies the row for each domain with the lowest id
inner_q = session.query(sa.func.min(Domain.id)).group_by(Domain.domain)
aliased = sa.alias(inner_q)
# Select the rows that do not match the subquery
q = session.query(Domain).filter(~Domain.id.in_(aliased))

# Delete the unmatched rows (SQLAlchemy generates a single DELETE statement from this loop)
for domain in q:
    session.delete(domain)
session.commit()

# Show remaining rows
for domain in session.query(Domain):
    print(domain)
print()

If you are not using the ORM, the core equivalent is:

meta = sa.MetaData()
domains = sa.Table('domain_names', meta, autoload=True, autoload_with=engine)

inner_q = sa.select([sa.func.min(domains.c.id)]).group_by(domains.c.domain)
aliased = sa.alias(inner_q)

with engine.connect() as conn:
    conn.execute(domains.delete().where(~domains.c.id.in_(aliased)))

This answer is based on the SQL provided in this answer. There are other ways of deleting duplicates, which you can see in the other answers on the link, or by googling "sql delete duplicates" or similar.

Python3 SQLAlchemy delete duplicates

1 Answers1