How to write SQL query to find duplicates in tables

Question

I'm currently trying to write a SQL query that finds any conflicts where any rows have the same x and y values.

Here are the tables I'm currently working with:

CREATE TABLE Slot (
    sid  INT,
    wall varchar(200),
    x    FLOAT,
    y    FLOAT,

    PRIMARY KEY (sid)
)

CREATE TABLE Route (
    rid     INT,
    name    varchar(200),
    circuit varchar(200),

    PRIMARY KEY (rid)
)

CREATE TABLE Placement (
    rid INT FOREIGN KEY REFERENCES Route(rid),
    hid INT FOREIGN KEY REFERENCES Hold(hid),
    sid INT FOREIGN KEY REFERENCES Slot(sid)
)

So I'm trying to find any Slots that are on the same wall and have identical x and y values. In addition to this, I want them to all be the same Route circuit.

I don't know if I should be trying to use the third table of "Placement", as I'm pretty new to this and got confused when trying to join them because they don't have any shared columns.

Here is what I currently have

SELECT
    DISTINCT
    S.sid

FROM
    Slot as S,
    Route as R

WHERE
    R.circuit = 'Beginner'

GROUP BY
    S.x,
    S.y,
    S.wall

HAVING
    COUNT(*) > 1

But this throws an error because I have to be using 'S.sid' in a GROUP BY or an aggregate function, but I don't want to group by that.

Here are the INSERT functions I was using to try and use as examples for what I have so far.

INSERT INTO Slot (sid, wall, x, y) VALUES (2345, 'south', 4, 7)
INSERT INTO Slot (sid, wall, x, y) VALUES (4534, 'south', 4, 7)
INSERT INTO Slot (sid, wall, x, y) VALUES (2456, 'west', 1, 7)

So here it would return the sid's 2345 and 4534 because they're both on the South wall and have the same x and y values.

You need to add `NOT NULL` to your `CREATE TABLE` statements, otherwise your columns will be nullable-by-default (which should not be a default either...) — Dai, Dec 15 '21 at 11:03
Don't we also need example data for your `Route` and `Placement` tables? — Dai, Dec 15 '21 at 11:05
Proptip: Don't use the _ancient_ "comma in `FROM`"-syntax for JOINs. Always use explicit joins: https://a-gentle-introduction-to-sql.readthedocs.io/en/latest/part3/explicit-joins.html — Dai, Dec 15 '21 at 11:10
If `x` and `y` represent coordinates then you should use `decimal`, not `float`. The `float` type is an _approximate type_ and so you cannot perform (meaningful) equality checks. So avoid using `float` (and `real`) wherever possible. — Dai, Dec 15 '21 at 11:20
"In addition to this, I want them to all be the same Route circuit." <-- This statement is ambiguous given that the `Placement` table already allows the same `Slot` (as there's no PK or `UNIQUE` constraint on that table at all). (What is the purpose of the `Hold` table? how does it affect your definition of "duplicate" data?), also what should happen if multiple `Slot` rows have identical `wall,x,y` values but don't have any corresponding `Placement` rows? Or where multiple `Placement` rows correspond to the same _and non-duplicate_ rows in `Slot`? — Dai, Dec 15 '21 at 11:30

score 0 · Answer 1 · answered Dec 15 '21 at 11:40

Some things you need to be made aware-of:

The ancient-style of JOIN, where you do SELECT ... FROM x, y WHERE x.a = y.b, should not be used. I wish modern RDBMS would block queries using it (outside of any compatibility mode).
- Always use explicit JOIN clauses! for the sake of readability and maintainability (while performance shouldn't be different, using explicit JOINs make it far, far easier to investigate performance issues should they occur).
Don't use the float or real types for representing exact quantities (by "exact" I don't mean integral: you can have exact fractional quantities), instead the decimal type should be preferred.
You should always include the Schema Name (default is dbo.) when referencing tables in queries as it solves issues relating to ambiguity.
- Including the schema-name is now required when referencing any UDFs, UDTs, and other more modern SQL Server features.
- This is because performing equality-checks on float and real values in T-SQL is a pain. This includes evaluating JOIN criteria.
Your many-to-many linking table dbo.Placement also allows duplicates because it doesn't have a PK defined.
Don't use short, cryptic column names like rid, hid and sid. Software should be self-documenting. I would name those columns to RouteId, HoldId, and SlotId respectively.
- Don't fall for the mistake of naming a column just Id. Column names should not need the name of their parent table to be understandable (this is because queries can/will/do expose your data, often with their original column names, in contexts without their original table names, such as in CTEs, derived-table queries, VIEWs, etc).
It's subjective, but I believe the table-names should be plural, not singular (after-all, a table holds multiple rows - I'd only give a table a singular name if that table will only ever hold a single row).
- The worst argument I've heard so far advocating for singular instead of plural is because (apparently) some ORMs and code-gen tools lack the ability to convert a plural noun to a singular noun. Yeesh. That hasn't been true for 20+ years now.

First, to avoid problems caused by using float types in JOIN conditions I'll change your dbo.Slot table to use decimal:

CREATE TABLE dbo.Slot2 (
    sid  int           NOT NULL,
    wall varchar(200)  NOT NULL,
    x    decimal(19,6) NOT NULL, -- 6 decimal places should be enough.
    y    decimal(19,6) NOT NULL,

    CONSTRAINT PK_Slot PRIMARY KEY ( sid ),

--  CONSTRAINT UK_SlotValues UNIQUE ( wall, x, y ) -- This will prevent duplicate values in future.
);

INSERT INTO dbo.Slot2 ( sid, wall, x, y )
SELECT
    sid,
    wall,
    CONVERT( decimal(19,6), x ) AS x2,
    CONVERT( decimal(19,6), y ) AS y2
FROM
    dbo.Slot;
    
DROP TABLE dbo.Slot;

EXEC sp_rename 'dbo.Slot2', 'Slot';

With that taken care-of, let's now get the duplicate values in the set of slots (i.e. find the identical wall, x, y values without other values):

SELECT
    wall,
    x,
    y
FROM
    dbo.Slot
GROUP BY
    wall,
    x,
    y
HAVING
    COUNT(*) >= 2

Then we do an INNER JOIN between the original dbo.Slot table and this set of duplicate values, as well as adding a ROW_NUMBER value to make it easier to choose a single row to keep if the other duplicates are removed:

WITH duplicateValues (
    SELECT
        wall,
        x,
        y
    FROM
        dbo.Slot
    GROUP BY
        wall,
        x,
        y
    HAVING
        COUNT(*) >= 2
)
SELECT
    ROW_NUMBER() OVER ( PARTITION BY s.wall, s.x, s.y ORDER BY s.sid ) AS n,
    s.*
FROM
    dbo.Slot AS s
    INNER JOIN duplicateValues AS d ON
        s.wall = d.wall
        AND
        s.x = d.x
        AND
        s.y = d.y

In your post you mentioned wanting to also consider the Placement table, however we need further details because your post doesn't explain how the Placement table should work.

However your Placement table should still have a PK. I'm assuming that the Placement table's HoldId column is not a key column, so should look like this:

CREATE TABLE dbo.Placement (
    RouteId int NOT NULL,
    SlotId  int NOT NULL,
    HoldId  int NOT NULL,

    CONSTRAINT PK_Placement PRIMARY KEY ( RouteId, SlotId ),
    
    CONSTRAINT FK_Placement_Route FOREIGN KEY ( RouteId ) REFERENCES dbo.Route ( rid ),
    CONSTRAINT FK_Placement_Slot  FOREIGN KEY ( SlotId  ) REFERENCES dbo.Slot ( sid ),
    CONSTRAINT FK_Placement_Hold  FOREIGN KEY ( HoldId  ) REFERENCES dbo.Hold ( hid )
);

How to write SQL query to find duplicates in tables

1 Answers1