How to remove rest of the ordered rows with the same ID after the first row which occurs more than once with that ID?

Question

I have the following structure for the table DataTable: every column is of the datatype int, RowID is an identity column and the primary key. LinkID is a foreign key and links to rows of an other table.

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
5       120      5      1       10
6       120      6      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
13      371      7      7       2
14      371      8      17      4
.................................
.................................

I'm trying to do a query which alters every LinkID batch in the following way:

Take every row with same LinkID (e.g. the first batch is the first 6 rows here)
Order them by the Order column
Look at Data and DataSpecifier columns as one compare unit (They can be thought as one column, called dataunit):
Keep as many rows from Order=1 onwards, until a dataunit comes by which appears more than one time in the batch
Keep that final row, but delete rest of the rows with same LinkID and greater Order value

So for the LinkID 120:

Sort the batch by the Order column (already sorted here, but should still do it)
Start looking from the top (So Order=1 here), go as long as you don't see a value which appears more than 1 time in the batch
Stop at the first duplicate Order=3 (dataunit 1 10 is also on Order 5).
Delete everything which has the LinkID=120 AND Order>=4

After similar process for LinkID 371 (and every other LinkID in the table), the processed table will look like this:

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
.................................
.................................

I've never done an SQL query which this complicated. I know the query has to be something like this:

DELETE FROM DataTable  
WHERE RowID IN (SELECT RowID
                FROM DataTable
                WHERE -- ?
                GROUP BY LinkID
                HAVING COUNT(*) > 1 -- ?
                ORDER BY [Order]);

but I just can't seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.

I asked a very similar question here: How to remove rest of the rows with the same ID starting from the first duplicate?

But since I realized that my original filtering logic in the question was not actually what I needed and that question had already been answered correctly, I had to make this new question.

I think you may be able to use the answer provided on here: https://stackoverflow.com/questions/42133711/how-to-include-only-rows-where-the-following-row-is-within-12-hours-and-rank-acc — BishNaboB, May 10 '19 at 08:45

score 2 · Accepted Answer · answered May 10 '19 at 08:47

Here, my previous solution updated. A few GROUP BY should be enough. Code is simple enough to be understood by reading.

Set up:

IF OBJECT_ID('tempdb..#YourData') IS NOT NULL
    DROP TABLE #YourData

CREATE TABLE #YourData (
    RowID INT,
    LinkID INT,
    [Order] INT,
    Data INT,
    DataSpecifier INT)

INSERT INTO #YourData (
    RowID,
    LinkID,
    [Order],
    Data,
    DataSpecifier)
VALUES
    ('1', ' 120', '1', '1', ' 1'), 
    ('2', ' 120', '2', '1', ' 3'), 
    ('3', ' 120', '3', '1', ' 10'), 
    ('4', ' 120', '4', '1', ' 13'), 
    ('5', ' 120', '5', '1', ' 10'), 
    ('6', ' 120', '6', '1', ' 13'), 

    ('7', ' 371', '1', '6', ' 2'), 
    ('8', ' 371', '2', '3', ' 5'), 
    ('9', ' 371', '3', '8', ' 1'), 
    ('10', '371', '4', '10', '1'), 
    ('11', '371', '5', '7', ' 2'), 
    ('12', '371', '6', '3', ' 3'), 
    ('13', '371', '7', '7', ' 2'), 
    ('14', '371', '8', '17', '4')

Solution:

;WITH DuplicatesByLinkID AS
(
    SELECT
        Y.LinkID,
        Y.Data,
        Y.DataSpecifier,
        [Order] = MIN([Order])
    FROM
        #YourData AS Y
    GROUP BY
        Y.LinkID,
        Y.Data,
        Y.DataSpecifier
    HAVING
        COUNT(*) > 1
),
FirstDuplicateByLinkID AS
(
    SELECT
        D.LinkID,
        MinOrder = MIN(D.[Order])
    FROM
        DuplicatesByLinkID AS D
    GROUP BY
        D.LinkID
)
DELETE Y FROM
    #YourData AS Y
    INNER JOIN FirstDuplicateByLinkID AS M ON
        Y.LinkID = M.LinkID AND
        Y.[Order] > M.MinOrder

SELECT * FROM #YourData

Result:

RowID   LinkID  Order   Data    DataSpecifier
1       120     1       1       1
2       120     2       1       3
3       120     3       1       10
7       371     1       6       2
8       371     2       3       5
9       371     3       8       1
10      371     4       10      1
11      371     5       7       2

Thank you so much! This works perfectly and was extremely easy to modify to work with my actual column and table names :) — ruohola, May 10 '19 at 10:13

How to remove rest of the ordered rows with the same ID after the first row which occurs more than once with that ID?

1 Answers1