-1

I am a data-analyst, just diving into the SQL-depths. Forgive me if my problem stated here is not formulated clear, I am open for suggestions to improve my problem formulating skills!

So I have this table, but there is some double information in it (see --DATASET & --VALUES, PersonID = 2, RowNr = 1 & 2). I want to filter out rows that are double, so that the statement reads: Select only those rows for which CreationID in RowNr 1 != CreationID in RowNr 2, with rows grouped by PersonID and StartDate.

Here's my script with the desired result:

-- DATASET
CREATE TABLE Conditional (
    PersonID int
    , StartDate date
    , CreationDate date
    , RowNr int
);

-- VALUES
INSERT INTO Conditional (PersonID, StartDate, CreationDate, RowNr)
VALUES
    ('1', '20151201', '20151220', '1'),
    ('1', '20151201', '20151220', '2'),
    ('2', '20160303', '20160304', '1'),
    ('2', '20160303', '20160401', '2');

-- DESIRED RESULT
INSERT INTO Conditional (PersonID, StartDate, CreationDate, RowNr)
VALUES
    ('1', '20151201', '20151220', '1'),
    ('2', '20160303', '20160304', '1'),
    ('2', '20160303', '20160401', '2');
jarlh
  • 42,561
  • 8
  • 45
  • 63
SHW
  • 461
  • 7
  • 26
  • 1
    I sort of see what you are trying to do, but it isn't clear why one of the `PersonID = 2` records is being filtered off and not the other. Maybe you can explain your logic better. – Tim Biegeleisen Mar 07 '18 at 10:02
  • 1
    is it for inserting into table or querying from table ? – Ven Mar 07 '18 at 10:02
  • 3
    Which [DBMS](https://en.wikipedia.org/wiki/DBMS) product are you using? Postgres? "SQL" is just a query language, not the name of a specific database product. –  Mar 07 '18 at 10:03
  • I guess only include records with same start and end dates – Ven Mar 07 '18 at 10:04
  • @Tim, you are right, I was wrong. I want to keep all information for which the CreationDate within (the same person with the same StartDate) differs. This means I want to keep both rows for Person 2, but only row 1 for Person 1. I adjusted the question. @B House: It is for querying data from a table. – SHW Mar 07 '18 at 10:07
  • The columns you are showing are the ones you want unique plus a row number. Does this match your real time scenario or are there other columns you want to select along? Do you need the row number in the results? And it is important you tell us what database system you want an answer for. – Thorsten Kettner Mar 07 '18 at 10:27

3 Answers3

0

You want to rank your records and only show the best ranked ones. To do so apply RANK on the rownr per personid, startdate, and creationdate. Then keep the rows ranked #1 with a limit clause, which is FETCH FIRST n ROWS WITH TIES in standard SQL.

select *
from conditional
order by rank() over (partition by personid, startdate, creationdate order by rownr)
fetch first 1 row with ties;

You haven't tagged your DBMS. Its limit clause may be different from standard SQL.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
0

What you are showing is insert-statements, so before it goes into the SQL-Database. I am answering based on the premise that you want to solve your problem in SQL.

You have different options:

  1. Create a unique key on PersonID + StartDate + CreationDate. You will then get an error upon trying to insert a duplicate.

  2. Keep all the data, filter it upon your Select-Events as needed. (see answer by Thorsten Kettner)

  3. Clean up after the insert and delete the duplicates: See this Question

You may also want to look on that RowNr field - may be superfluous and better to create this on the fly in your select.

Community
  • 1
  • 1
Daniel
  • 426
  • 3
  • 14
0

If on SQL Server or Oracle SQL:

WITH DuplicationRanking AS
(
    SELECT
        D.*,
        ROW_NUMBER() OVER (
            PARTITION BY
                PersonID, -- Your unique columns here
                StartDate,
                CreationDate
            ORDER BY
                RowNr ASC) AS DuplicationRanking
        FROM
            Conditional AS D
)
SELECT
    D.*
FROM
    DuplicationRanking AS D
WHERE
    D.DuplicationRanking = 1
-- You can GROUP BY here

On a general basis, using GROUP BY by the mentioned columns:

SELECT
    C.*
FROM
    Conditional AS C
    INNER JOIN
        (SELECT
            PersonID,
            StartDate,
            RowNr = MIN(RowNr)
        FROM
            Conditional
        GROUP BY
            PersonID,
            StartDate) AS D ON
        C.PersonID = D.PersonID AND
        C.StartDate = D.StartDate AND
        C.RowNr = D.RowNr
EzLo
  • 13,780
  • 10
  • 33
  • 38
  • That should be `ORDER BY rownr` I guess. And it's just Oracle SQL, not PL/SQL, which is a programming language in the DBMS used in triggers, functions, etc. – Thorsten Kettner Mar 07 '18 at 10:30