6

I want to select the count of all contacts with the same email address that have more than one duplicate. I can't get this query working in SQLAlchey with PostgreSQL.

SELECT count(*), email FROM contact group by email having count(*) > 1

I tried this:

all_records = db.session.query(Contact).options(
    load_only('email')).group_by(Contact.email).having(
    func.count('*') > 1).all()
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "contact.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT contact.id AS contact_id, contact.email AS contact_em...
           ^
[SQL: 'SELECT contact.id AS contact_id, contact.email AS contact_email \nFROM contact GROUP BY contact.email \nHAVING count(%(count_1)s) > %(count_2)s'] [parameters: {'count_1': '*', 'count_2': 1}]

And I tried this:

all_records = db.session.query(func.count(Contact.id)).options(
    load_only('email')).group_by(Contact.email).having(
    func.count('*') > 1).all()
sqlalchemy.exc.ArgumentError
sqlalchemy.exc.ArgumentError: Wildcard loader can only be used with exactly one entity.  Use Load(ent) to specify specific entities.

It works correctly if I execute raw SQL:

all_records = db.session.execute(
    "SELECT count(*), email FROM contact group by email"
    " having count(*) > 1").fetchall()

I'm using Flask-SQLAlchemy, but here's a minimal SQLAlchemy setup to demonstrate the issue:

import sqlalchemy as sa
from sqlalchemy import orm
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class Contact(Base):
    __tablename__ = 'contact'
    id = sa.Column(sa.Integer, primary_key=True)
    email = sa.Column(sa.String)

engine = sa.create_engine('postgresql:///example', echo=True)
Base.metadata.create_all(engine)
session = orm.Session(engine)
session.add_all((
    Contact(email='a@example.com'),
    Contact(email='b@example.com'),
    Contact(email='a@example.com'),
    Contact(email='c@example.com'),
    Contact(email='a@example.com'),
))
session.commit()

# first failed query
all_records = session.query(Contact).options(
    orm.load_only('email')).group_by(Contact.email).having(
    sa.func.count('*') > 1).all()

# second failed query
all_records = db.session.query(sa.func.count(Contact.id)).options(
    orm.load_only('email')).group_by(Contact.email).having(
    sa.func.count('*') > 1).all()

With the sample data, I expect to get one result row, 3, a@example.com.

davidism
  • 121,510
  • 29
  • 395
  • 339
bool.dev
  • 17,508
  • 5
  • 69
  • 93
  • 2
    As to why id is added even though you specify `load_only()`, see [this](https://stackoverflow.com/a/44843223/2681632). – Ilja Everilä Oct 27 '17 at 16:19

2 Answers2

8

You're not building the same query in SQLAlchemy that you're writing manually.

You want to select the count of each email that has more than one occurrence.

q = session.query(
    db.func.count(Contact.email),
    Contact.email
).group_by(
    Contact.email
).having(
    db.func.count(Contact.email) > 1
)
print(q)
SELECT count(contact.email) AS count_1, contact.email AS contact_email 
FROM contact GROUP BY contact.email 
HAVING count(contact.email) > %(count_2)s

The first query fails because you query the entire model, so SQLAlchemy selects all columns. You can only select grouped columns when using group_by. SQLAlchemy must always select the primary key when querying the entire model, load_only doesn't affect that.

The second query fails because load_only only works when selecting an entire model, but you're selecting an aggregate and a column.

davidism
  • 121,510
  • 29
  • 395
  • 339
4

Just select what you would in a text query:

db.session.query(func.count('*'), Contact.email).\
    group_by(Contact.email).\
    having(func.count('*') > 1).\
    all()
bool.dev
  • 17,508
  • 5
  • 69
  • 93
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127