Understanding Self-Join Distinct Query

Question

I'm unsure how this query works regarding the from EMPLOYEE A and EMPLOYEE B. I know distinct removes duplicated values, but why are there two tables if you are only selecting A.EMPLOYEE_ID, A.LAST_NAME, A.TITLE, doesn't B.EMPLOYEE_ID, B.LAST_NAME, B.TITLE need to be included in the SELECT part too? How does it work? Also I know that <> means not equal to.

SELECT DISTINCT A.EMPLOYEE_ID, A.LAST_NAME, A.TITLE
FROM EMPLOYEE A, EMPLOYEE B
WHERE A.EMPLOYEE_ID  <> B.EMPLOYEE_ID
AND  A.TITLE  <> B.TITLE
AND B.LAST_NAME = A.LAST_NAME

Your question seems to be around the intent of the query, not a technical one. You'd need to ask the author of the query or refer to any requirements it seems to be fulfilling. `DISTINCT` or indeed any `SELECT` does not have to use all of the tables or columns referred to if that is what you are asking. — LoztInSpace, Nov 03 '18 at 23:21
Possible duplicate of [What is a self join for? (in english)](https://stackoverflow.com/questions/36352271/what-is-a-self-join-for-in-english) — philipxy, Nov 04 '18 at 00:07
Hi. How is this not answered by googling?--you already know the term "self-join". How does an answer not just repeat a definiton of SQL SELECT statements? Edit to clearly justify your claim that there is no difference--if you don't find the flaw, show us & ask us. Returned rows are not a restriction & projection of the one table, they are a projection of the rows left after the join & restriction. They *are* a projection of rows that are in the table *that also have certain matches among the rows in the table*. See my answer at the link to describe returned rows vs whatever rows you expect. — philipxy, Nov 04 '18 at 00:27

score 0 · Accepted Answer · answered Nov 04 '18 at 00:48

This query returns all employees who share a last name with another employee, but not a title or employee id.

A more efficient way to write this query uses exists:

select e.*
from employee e
where exists (select 1
              from employee e2
              where e2.last_name = e.last_name and
                    e2.employee_id <> e.employee_id and
                    e2.title <> e.title
             );

Writing the query this way saves the effort for the removing duplicates from the join.

For performance, you would want an index on employee(last_name, employee_id, title).

score 0 · Answer 2 · answered Nov 04 '18 at 01:07

The returned rows are not a restriction & projection of the one table, they are a projection of the rows left after the join & restriction. They are columns from rows that are in the table but that also have certain matches among the rows in the table. That happens to be rows in a restriction & projection of all rows that can be made from a row in the table paired with a row in the table--which is what the comma's product (cross join) gives. (Explicit cross join or inner join could be used instead.)

The returned rows are for employees for which there exists another employee with the same last name but different id & title.

Still--Why was that particular query expression written? A query asks for rows that satisfy some criterion. Each base table is rows that satisfy some criterion. Each relational operator & SQL subexpression calculates rows that meet a criterion that is a certain transformation of the criteria of its argument tables. x NATURAL JOIN y is rows that satisfy the criterion for x AND the criterion for y; x UNION y is rows that satisfy the criterion for x OR the criterion for y; x WHERE c is rows that satisfy the criterion for x AND c; etc. So we write an SQL query whose criterion is the criterion for the rows we want. That just unfortunately isn't ever taught explicitly. (This correspondence between relational & logical operators and between relation values & criteria is the foundation of the relational model.)

Is there any rule of thumb to construct SQL query from a human-readable description?
What is a self join for? (in english)

Understanding Self-Join Distinct Query

2 Answers2