How to eliminate duplicates in certain columns based on logic in non duplicate columns

Question

I am trying to modify this report that I have. The report is supposed to identify changes to attributes of a set of data.

For example, a person and his address with respect to his lease renewals. So imagine that the person was at address-1 from 1-1-2018 through 3/12/2018, then moved to another address-2, initially signing lease for 3/12/2018 through 12/31/2021, but then changed the lease later to 3/12/2018 through 10/1/2018. He then went ahead and extended his lease from 10/2/2018 to 12/31/2020. In this case, i want to not show the initial lease signing period of 3/12/2018 through 12/31/2021. The current table stores all of these records and i don't want this particular row to be shown just so the end user doesn't get confused.

The way I look at it, if i am able to get all the "duplicates" in columns Person, Person_ID, Phone_Number, end_date and then for those duplicate rows only keep the highest start_date, i would be able to be report out the right way. I am, however, stuck in my ability to eliminate the min(start_date) from those rows while showing all of the data.

select 
    subset.person, 
    subset.person_id, 
    subset.phone_number, 
    subset.end_date, 
    count(*)
from subset
group by 
    subset.subset.person, 
    subset.person_id, 
    subset.phone_number, 
    subset.end_date
having count(*)>1
order by person_id;

Now for these rows, i need to select the max(start_date) and keep only those in the report (I think i am stuck there).

So the table essentially shows the following:

Person  Person ID   Phone Number    Address 1   Zip Code    Start Date  End Date
Jane    1   8792029484  some address-1  12345   3/5/2016    11/9/2018
Jane    1   8792029484  some address-2  34455   11/10/2018  12/31/9999
Jane    1   8792029484  some address-2  34455   11/10/2018  12/7/2018
Jane    1   8792029484  some address-2  34455   12/8/2018   12/31/9999
John    2   9808845768  another address-1   68687   1/1/2013    11/9/2018
John    2   9808845768  another address-2   89384   11/10/2018  12/31/9999
John    2   9808845768  another address-2   89384   11/10/2018  12/7/2018
John    2   9808845768  another address-2   89384   12/8/2018   12/31/9999

For Jane and John, I don't want the following rows to be displayed:

    Jane    1   8792029484  some address-2  34455   11/10/2018  12/31/9999
    John    2   9808845768  another address-2   89384   11/10/2018  12/31/9999

Expected result:

   Person   Person ID   Phone Number    Address 1   Zip Code    Start Date  End Date
   Jane 1   8792029484  some address-1  12345   3/5/2016    11/9/2018
   Jane 1   8792029484  some address-2  34455   11/10/2018  12/7/2018
   Jane 1   8792029484  some address-2  34455   12/8/2018   12/31/9999
   John 2   9808845768  another address-1   68687   1/1/2013    11/9/2018
   John 2   9808845768  another address-2   89384   11/10/2018  12/7/2018
   John 2   9808845768  another address-2   89384   12/8/2018   12/31/9999

It sounds like you want to get the most recent lease, based not on the dates of the lease itself, but on the date when the lease was signed. (In your first example, the person has two leases that start on 3/12/2018, but you want the record that the person signed most recently.) Is that right? If so, do you have a field with something like "activity date"? — A. S. K., Feb 13 '19 at 21:17
I actually do not have a field like that. Even if I had it, it would need me to get the max(activity date) to eliminate the row. I think the same can be achieved by only keeping the max(start date); just don't know how to get there. — ForeverLearning, Feb 13 '19 at 21:30
The answers to [this question](https://stackoverflow.com/questions/121387/fetch-the-row-which-has-the-max-value-for-a-column) show several different ways to get what you need. You may still end up with duplicates if you have people with multiple leases that start on the same day, but it's a start! — A. S. K., Feb 13 '19 at 21:34

score 0 · Accepted Answer · answered Feb 13 '19 at 21:35

Here is a query that might do what you expect. The subquery makes use of window function ROW_NUMBER() to assign a number to each record, within groups of records having the same PersonID/adress1/StartDate. Row numbers are ordered by EndDate. Then, the outer query keeps just the record having the minimum EndDate in each group.

SELECT person, personid, phonenumber, address1, zipcode, startdate, enddate
FROM (
    SELECT 
        s.*, 
        ROW_NUMBER() 
            OVER(PARTITION BY personID, address1, startdate ORDER BY enddate) rn
    FROM subset s
) x WHERE rn = 1

This demo on DB Fiddle with your sample data returns :

PERSON | PERSONID | PHONENUMBER | ADDRESS1          | ZIPCODE | STARTDATE | ENDDATE  
:----- | -------: | ----------: | :---------------- | ------: | :-------- | :--------
Jane   |        1 |  8792029484 | some-address-1    |   12345 | 05-MAR-16 | 09-NOV-18
Jane   |        1 |  8792029484 | some-address-2    |   34455 | 10-NOV-18 | 07-DEC-18
Jane   |        1 |  8792029484 | some-address-2    |   34455 | 08-DEC-18 | 31-DEC-99
John   |        2 |  9808845768 | another-address-1 |   68687 | 01-JAN-13 | 09-NOV-18
John   |        2 |  9808845768 | another-address-2 |   89384 | 10-NOV-18 | 07-DEC-18
John   |        2 |  9808845768 | another-address-2 |   89384 | 08-DEC-18 | 31-DEC-99

Thank you. This worked! – ForeverLearning Feb 14 '19 at 19:51 — ForeverLearning, Feb 14 '19 at 19:51

score 0 · Answer 2 · answered Feb 13 '19 at 22:50

0

I am guessing that your logic is that you want only one row with an enddate of 12/31/9999 and that should be from the most recent startdate.

If so:

select t.*
from (select t.*,
             row_number() over (partition by person_id, enddate order by startdate desc) as seqnum
      from subset t
     ) t
where enddate <> date '9999-12-31' or seqnum = 1;

answered Feb 13 '19 at 22:50

Gordon Linoff

1,242,037
58
646
786

I think this would also work, although I tried the prior answer. Thanks! – ForeverLearning Feb 14 '19 at 19:52

How to eliminate duplicates in certain columns based on logic in non duplicate columns

2 Answers2