How can I remove duplicates from a query result if I know the table and column that is causing the duplicates?

Question

I have a huge query with a lot of JOINs. It is producing duplicates.

I am using this technique below that I found here on SO to identify which table the duplicates come from:

SELECT
   TableA = '----------', TableA.*,
   TableB = '----------', TableB.*
FROM ...

Here is an example of the data:

TABLE_A     USER_ID             TABLE_B                 LOCATION                    USER_CODE   LOCATION_CODE   TABLE_C                     SCI_YEAR_CODE
USER        1092993811          COL_PATHS_SCIENCE_ED    University Of N. Maryland   NULL        ND              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    University Of N. Maryland   NULL        ND              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT
USER        1092993811          COL_PATHS_SCIENCE_ED    University Of N. Maryland   NULL        ND              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    University Of N. Maryland   NULL        ND              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT
USER        1092993811          COL_PATHS_SCIENCE_ED    California of College       NULL        MH              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    California of College       NULL        MH              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT
USER        1092993811          COL_PATHS_SCIENCE_ED    California of College       NULL        MH              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    California of College       NULL        MH              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2016_AAB
USER        1092993811          COL_PATHS_SCIENCE_ED    New York City Tech          NULL        BS              BIO_PATHS_SCIENCE_RESEARCH  2017_RRT

You can see the table columns causing the most duplicates come from TABLE_C, BIO_PATHS_SCIENCE_RESEARCH.

For the SCI_YEAR_CODE, I just need to get the most recent date and only need the SCI_YEAR_CODE that ends with RRT

Is there a way to "weed" these duplicates out?

Thanks!

start by using DISTINCT, and when you get such results this means you have to rethink your model — nbk, Jul 06 '23 at 15:24
no first try a simple DISTINCT with out any thing at the start of the select — nbk, Jul 06 '23 at 15:26
Don't try to solve this with `DISTINCT`. You say: "For the SCI_YEAR_CODE, I just need to get the most recent date and only need the SCI_YEAR_CODE that ends with RRT". Have you put anything in your query to meet these conditions? You may be looking for `CROSS APPLY` in combination with `TOP` to only join the most recent entry from a table. Look this up and come back when you get stuck. In that case show your query and sample data, so we see what you are trying to achieve. — Thorsten Kettner, Jul 06 '23 at 15:31
In order to help you, you'd need to update your question with a minimal reproducible example i.e. a SQL statement that illustrates the issues; definitions of the tables involved; sample data to illustrate the issue — NickW, Jul 06 '23 at 15:32

score 1 · Accepted Answer · answered Jul 06 '23 at 15:55

You can use the ROW_NUMBER() to assign a sequential number for each row within each USER_ID, LOCATION_CODE, and TABLE_C partition and then filter the results to include only rows where RowNum = 1 :

   SELECT *
    FROM (
        SELECT
            ROW_NUMBER() OVER (PARTITION BY USER_ID, LOCATION_CODE, TABLE_C ORDER BY SCI_YEAR_CODE DESC) AS RowNum,
            TABLE_A.*,
            TABLE_B.*,
            TABLE_C.*
        FROM
            TABLE_A
        JOIN
            TABLE_B ON TABLE_A.USER_ID = TABLE_B.USER_ID
        JOIN
            TABLE_C ON TABLE_B.LOCATION_CODE = TABLE_C.LOCATION_CODE
    ) AS sub
    WHERE
        sub.RowNum = 1
        AND SCI_YEAR_CODE LIKE '%RRT';

How can I remove duplicates from a query result if I know the table and column that is causing the duplicates?

1 Answers1