Select data from a table where only the first two columns are distinct

Question

Background

I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.

I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)

However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)

The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).

What I've tried

I found the following post which seems to achieve what I want:

How do I (or can I) SELECT DISTINCT on multiple columns?

I tried the answers from both Joel,and Erwin.

Attempt 1:

However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:

select a.emp_no, 
        a.eec_planning_unit_cde, 
        '11' as area, create_dte, 
        create_by_emp_no, modify_dte,
        modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
    select emp_no, eec_planning_unit_cde
    from tempdb.guest.temp_part_time_evaluator
    group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde

Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?

Attempt 2:

I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))

Here is the attempt:

select emp_no, 
    eec_planning_unit_cde, 
    '11' as area, create_dte, 
    create_by_emp_no, 
    modify_dte,
    modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
    select emp_no, eec_planning_unit_cde
    from tempdb.guest.temp_part_time_evaluator
    group by emp_no, eec_planning_unit_cde
)

Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.

Overview of Questions:

Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?

score 2 · Accepted Answer · answered Oct 24 '12 at 21:12

2

A select distinct will be based on all columns so it does not guarantee the first two to be distinct

select pk1, pk2, '11', max(c1), max(c2), max(c3) 
from table 
group by pk1, pk2

answered Oct 24 '12 at 21:12

paparazzo

44,497
23
105
176

Yeah, I'm aware of how distinct works on all columns, I just wasn't sure how to accomplish what I wanted to achieve. This works! And it is so simple, thanks so much! – Steve's a D Oct 24 '12 at 21:18

score 1 · Answer 2 · answered Oct 24 '12 at 21:09

You could TRY this:

SELECT a.emp_no, 
        a.eec_planning_unit_cde, 
        b.'11' as area, 
        b.create_dte, 
        b.create_by_emp_no, 
        b.modify_dte,
        b.modify_by_emp_no
FROM 
(
    SELECT emp_no, eec_planning_unit_cde
    FROM tempdb.guest.temp_part_time_evaluator
    GROUP BY emp_no, eec_planning_unit_cde
) a 
JOIN tempdb.guest.temp_part_time_evaluator b 
     ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde

That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.

SELECT a.emp_no, 
        a.eec_planning_unit_cde, 
        a.'11' as area, 
        a.create_dte, 
        a.create_by_emp_no, 
        a.modify_dte,
        a.modify_by_emp_no
FROM 
(
    SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
            a.emp_no, 
            a.eec_planning_unit_cde, 
            a.'11' as area, 
            a.create_dte, 
            a.create_by_emp_no, 
            a.modify_dte,
            a.modify_by_emp_no
    FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1

score 1 · Answer 3 · edited May 23 '17 at 12:12

I'll reply one by one:

Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do

Inner join don't do an intersection. Le'ts supose this tables:
```
T1   T2
n s  n s
1 A  2 X   
2 B  2 Y
2 C
3 D
```
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
```
select *
from t1 inner join t2
   on t1.n = t2.n;

| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
```
And, your second query approach:
```
select *
from t1 
where t1.n in (select n from t2);

| N | S |
---------
| 2 | B |
| 2 | C |
```
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?

Yes, this subquery:
```
select *
from t1 
where not exists (
   select 1
   from t2
   where t2.n = t1.n
);
```
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?

yes, using @JTC second query.

Select data from a table where only the first two columns are distinct

3 Answers3