-1

I already know what a self-join does. Thank you, I also read all the other computerised operational descriptions on stack overflow, so I know this is not actually a duplicate question, so please do not give me tables or join lists.

What I am seeking to why it would be done (and please, not just the self-referencing employee-manager example).

In plain English, what would I seek to achieve from a self join?

My usage is in a university course, and coming from a Relational Algebra angle. I have done some SQL for a few years but the instructor loves to do self-joins on tables (after renaming one or more fields). Not something often done in SQL, so I'm wondering what the action is that he is trying to perform but he seems pretty keen about doing it frequently.

I thought I'd ask here, as many others have asked for this information but get marked as "already answered" but all the answers give operational descriptions NOT the "why is this being done".

philipxy
  • 14,867
  • 6
  • 39
  • 83
mist42nz
  • 97
  • 1
  • 8

2 Answers2

0

The reason why the employee-manager example is so common, is because it hits the nail on the head. A self join on a table looks for pairs of rows, like any join, but with both rows coming from the same table. Nothing special really.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Thanks for the reply but I think you're missing what I'm asking.. Employee-manager does hit that _particular_ nail on the head precisely; and it's, Relational Algebra-icly speaking, a very particular type of self-referencing behaviour. That's why I specifically _excluded_ it from replies. On the surface it appears the self join [natural, since we're talking Relational Algebra, not SQL] is "nothing special" so _why_ do it? I'm thinking : Query1 = tableX ⋈ col1↦col1`, col2↦col2` (tableX) Query2 = σ col1 = col1 ` (σ col2 = col2'(Query1)) would"return full list of unique lines in tableX" – mist42nz Apr 02 '16 at 23:54
  • Oh, I think I it's not that particular at all. It occurs all the time, whenever two objects of the same type are related. Bank accounts, saving and checking? But you seem to miss that the join is almost always cross-column. It's not because a self join on a single column returns the table - it usually doesn't. It can return far, far more rows. Consider a self-join on a table of people where the genders are identical. It will return every possible pair of males _and_ every possible pair of females. Well-defined but rarely needed. – MSalters Apr 03 '16 at 00:08
  • Sorry, are you still join on about the employee-manager thing ("particular"). Or are you answering the question I asked? If you are your answer is very confusing. so far my answer appears, from my own research to be "as to create a double entry on all related items" ... ....thus to be set up for removal of true duplicates (select on <> for renamed fields) so to leave true instances where their are multiple cases of common rows. That can be removed for single, or projected to show reoccurances. – mist42nz Apr 04 '16 at 11:09
  • @mist42nz: I was talking about the general case in everything but the first sentence of my answer. "Removal of duplicates" is only one example where you do a self join. This is clear from my second sentence : a self join looks for any pair of rows. (I think you call that a "double entry"). Duplication removal' is a specific case in which you want such pairs: you intend to delete one of the two rows in each pair. – MSalters Apr 04 '16 at 13:30
  • Thanks that's much more along the lines of what I was asking. So self-join (relational algebra speaking) is helpful in : self-referrencing cases (employee-manager), in duplication removal (via removing rows which self-pair). Any other particularily useful operations that follow in that line. This means I can now speak to people in lay language and say "a way of doing duplication removal is to first do a self join". Thus it becomes a English discuss (or entrance to pseudo-code) which is exactly what I'm looking for. Thank you again! – mist42nz Apr 05 '16 at 22:59
  • @mist42nz What do you mean, "return full list of unique lines in tableX"? Since tableX is a relation, every "line" (row?) is "unique". Almost none of your English description of query results or cases in these comments makes sense. (And from your use of scare quotes you seem to know it.) To be understood in natural language means you have to put in the (sometimes considerable) effort to be *clear*. And if you don't know what you mean, then make that clear. By forcing yourself to write clearly you will improve your ability to understand & explain questions like this one for yourself. – philipxy May 23 '16 at 02:47
  • Sorry Phil but your inability to understand is not my problem. If you do a self, i.e. Natural join, there is a lot of duplicated rows. I was using lines because I was trying to steer clear of the SQL is everything idea and deal with the logic involved. It's not so much that the lines themselves are clones, but the information in the line is duplicated, which is a little tough to cover in this little sound bite box. With MSalters reply and the Prof at university I was able to get usable information – mist42nz May 25 '16 at 13:26
  • Where the prof has trouble getting down to the layperson.programmer level of discussion. One of the results of the self join is that the triples, (line) will carry its own value, and that of other similar lines. Eg if matched on an id, the. One row for each ID will have double values that are identical. If there is only one ID in the table, that one row will have duplicates. Where if the ID has two entries, then the renamed column/field will show up as an entry where that field is different from the original table. So by removing all the unique "same data" we're left with a complete list – mist42nz May 25 '16 at 13:32
  • Of just the information held in the duplicate ID's. We can the use that on a line by line basis, eg if there's a start and end date, we can see if there are overlaps in the period for that ID. Or we can look at the data, knowing that it represents a group of all data with "more than one occurrence". Thus we could remove this group from the original group, and thus leave only those with unique/single/one off data in the rename field(s), or instead we could do a second self join on our duplicates and remove those with only two entries in the original table. This – mist42nz May 25 '16 at 13:38
  • I knew what you were trying to say in your question and comments. I was just telling you that you weren't managing to say it. (The same goes for your comments since.) This is a problem for you if you want to be able to communicate with other people. (Thanks for the effort in those last 3 comments though.) – philipxy Jun 01 '16 at 01:38
  • This answer begs the question. – philipxy Nov 12 '17 at 20:45
0

The database designer gives each base table a predicate (sentence template parameterized by column names).

Parent(person, child) -- person PERSON is parent of person CHILD
Likes(person, food) -- person PERSON likes food FOOD

Relational algebra is designed so that the value of a relational expression (base table name or operator call) holds the rows that make a true proposition (statement) from its predicate.

/* (PERSON, CHILD) rows where
person PERSON is parent of person CHILD
*/
Parent

The predicate of an expression that is a call to operator NATURAL JOIN is the AND of the predicates of its inputs.

/* (PERSON, CHILD, FOOD) rows where
person PERSON is parent of person CHILD AND person PERSON likes food FOOD
*/
Parent NATURAL JOIN Likes

Ditto for UNION & OR, MINUS & AND NOT, PROJECT column(s) & EXISTS other column(s), RESTRICT condition & AND condition and RENAME of a column & rename of a parameter.

/* (CHILD, FOOD) rows where
there EXISTS a value for PERSON such that
    person PERSON is parent of person CHILD AND person CHILD likes food FOOD
*/
PROJECT child, food (Parent NATURAL JOIN (RENAME person:=child Likes))

So every query expression's value holds the rows that make its predicate into a true statement.

Suppose we define algebraic self-join of a table as NATURAL JOIN of two tables got from an original via sequences of zero or more renamings. Per above we NATURAL JOIN for rows that satisfy the AND of predicates. A self-join arises when we want the rows that satisfy a result predicate expressed via predicates that differ only in parameters/columns.

/* (PERSON, FOOD, CHILD) rows where
person PERSON likes food FOOD AND person CHILD likes food FOOD
*/
Likes NATURAL JOIN (RENAME person:=child Likes)

There's nothing special about a self-join arising in a given query in a given application other than that.


SQL SELECT DISTINCT statements can be described via algebraic operators. They also calculate query predicates. First FROM table columns are RENAMEd by prefixing a table alias (correlation name) & a dot. (SQL NATURAL JOIN doesn't dot common columns.) The new tables are NATURAL JOINed. ON and WHERE RESTRICT per a condition. Then the SELECT DISTINCT clause RENAMES to remove dots from returned columns & PROJECTS away unwanted dotted columns.

We can convert SQL to predicates directly: Dotting input columns renames. NATURAL/CROSS/INNER JOIN, ON & WHERE give AND. Each dot-free result column gives an AND that it equals its dotted version. Finally dropping all dotted columns gives EXISTS.

/* same as above */
/* (PERSON, FOOD, CHILD) rows where
there EXISTS values for P.* & C.* such that
        PERSON = P.PERSON AND CHILD = C.person AND FOOD = P.FOOD
    AND person P.CHILD likes food P.FOOD
    AND person C.CHILD likes food C.FOOD
    AND P.FOOD = C.FOOD
*/
SELECT DISTINCT p.person AS person, c.person AS child, p.food AS food
FROM Likes p INNER JOIN Likes c
ON p.food = c.food

Again: In SQL we say there is a self-join when multiple table aliases of a JOIN are associated with the same table value; in application terms that means we can express a query meaning in terms of predicates differing in some parameters/columns; there's nothing special about applications or table meanings for this to arise.


See this re query semantics which happens to include a link to this re self-join semantics in particular.

philipxy
  • 14,867
  • 6
  • 39
  • 83
  • 2
    Thank you for taking the time. But please read the question properly next time. There are plenty of SQL and even relational algebra replies on the Web, and even on Stackoverflow about the mechanics of self join. What you completely and utterly failed to do, was to give an English description of what it does (I.e. Why you'd want to do a self join), which was the entire point of the question. IIRC it was, by renaming one or more fields; and using self join to match and remove via redundancy other fields, the self join allows search of duplicates, overlaps, and multiplicity in a table. – mist42nz May 25 '16 at 13:16
  • This answer explains exactly "Why you'd want to do a self join" in terms of reasoning about the business/application: "when we want the rows that satisfy predicates that differ only in parameters [of sentence templates]". You asked to "to steer clear of the SQL is everything idea and deal with the logic involved". Logic *is by definition* the manipulation of predicates. The only thing this answer has to do with algebra & SQL is that it tells you how to convert Engish statements about the business/business to them. It says why & when the joins & renamings arise. – philipxy Nov 12 '17 at 20:41
  • Your comments do not reflect an understanding of what I wrote. I talk about business logic then, since the question asks how that leads to algebra/SQL inner join, I also talk about how that leads to inner join. I clearly do not "use" algebra or SQL for business logic, I use English. Formal logic is a subset of English, the language of precision in engineering (including software engineering) and science (including computer science). You are ignoring my use of "predicate" (which generalizes SQL use) as *a statement using names for things*, without which communication is impossible. – philipxy Nov 15 '17 at 00:47
  • philip, why do we put data into tables. – mist42nz Nov 16 '17 at 08:47
  • Because "Each named table has a predicate (sentence template parameterized by column names)." and "The value of an expression that is a table's name holds the rows that make a true proposition (statement) from its predicate." You have only to give any example to see this. – philipxy Nov 16 '17 at 09:30
  • @ADyson This is a clear, justified, illustrated, self-contained answer to the question that can be applied to any example. Whether it uses unfamiliar words or organization is irrelevant. (See also my previous comments.) If you can't follow or apply it then please ask for clarification where you get stuck. If you don't try to use it then you don't know whether it is useful. You might try to soundly & clearly justify that any query whatsoever that you write meets its specification in application terms, because that cannot be done more simply than using the notions in this answer. – philipxy Sep 05 '19 at 07:40