Is a SQL self-join the best way to implement "lookups"?

Question

I am running an SQL query which self-joins the same table 24 times in order to "look up" particular rows of the table according to 24 different criteria, so that I can use all those values in a calculation. While the performance is perfectly fine (the table is indexed and the join criteria are restrictive), I can't help but feel there is a code smell here.

Is there a better way of doing lookups in SQL?

(Apologies for not including an example; I hope I have phrased the question in a general fashion).

Edit: attempting an example anyway:

CREATE TABLE key (
    pk1 int,
    pk2 int,
    pk3 int,
    PRIMARY KEY (pk1, pk2, pk3)
);

CREATE TABLE values (
    pk1 int,
    pk2 int,
    pk3 int,
    pk4 int,
    pk5 int,
    value int,
    PRIMARY KEY (pk1, pk2, pk3, pk4, pk5)
 );

 SELECT k.pk1, k.pk2, k.pk3,
        v1.value + v2.value - v3.value * (v4.value / v5.value) + ... + v24.value as result
 FROM key k
     LEFT JOIN values v1
         on v1.pk1=k.pk1
         and v1.pk2=k.pk2
         and v1.pk3=k.pk3
         and v1.pk4=100
         and v1.pk5=200

     LEFT JOIN values v2
         on v2.pk1=k.pk1
         and v2.pk2=k.pk2
         and v2.pk3=k.pk3
         and v2.pk4=400
         and v2.pk5=800

     ...

     LEFT JOIN values v24
         on v24.pk1=k.pk1
         and v24.pk2=k.pk2
         and v24.pk3=k.pk3
         and v24.pk4=900
         and v24.pk5=700;

Edit 2: The reason for this structure is that the values table represents (mathematically speaking) a function of 5 variables, with pre-computed return values stored in the table for a variety of parameters.

If these 24 rules are standard, why not create a view instead? — Icarus, Jun 21 '12 at 19:15
I think you're onto something - 24 self-joins generally *smells* bad. But without knowing your particular data structure… Seems like another technology, outside of SQL would be a better candidate to handle the calculation. — Jason McCreary, Jun 21 '12 at 19:15
Are you returning information from 24 different rows in the original table? Or, do you have 24 different criteria used to find a single appropriate row? — Gordon Linoff, Jun 21 '12 at 19:21
Instead of a view you could put the 2-table join into a CTE and reference the CTE 24 times. (a CTE is a kind of instant view; like a lambda) Example: http://stackoverflow.com/a/11128318/905902 (the two-table join only referenced two times) — wildplasser, Jun 21 '12 at 19:31
Just one NULL value from any of the left-join legs will set your expression to NULL. Do you really want that? — wildplasser, Jun 21 '12 at 19:37
wildplasser: Yes, I want to be able to identify where contributing data is missing. — jl6, Jun 21 '12 at 19:42
If you are only looking for the {pk1,pk2,pk3} tuples with less than 24 detail records, there are simpler ways. WRT the expression + 24 subjoins : it looks rather hard to simplify without transpose/pivot. — wildplasser, Jun 21 '12 at 19:45
wildplasser: Sorry I should have been clearer: I want a result for all {pk1,pk2,pk3} tuples, and I want that result to be null where a contributing value is not present in the values table. Agreed, this is a kind of pivot, but my SQL engine does not support PIVOT (or CTEs). — jl6, Jun 21 '12 at 19:49
So what is "your SQL engine"? Perhaps there are constructs or optimizations available for your specific platform that are possible, but nobody can offer them because you've intentionally left out that detail. — Aaron Bertrand, Jun 21 '12 at 20:45
Why not tag it as such? Maybe you'll attract attention from folks in that space that haven't seen your question in the noise. — Aaron Bertrand, Jun 21 '12 at 21:13

score 2 · Accepted Answer · answered Jun 21 '12 at 20:20

2

To start with this isn't a self-join at all.

A self-join is when a table is joined to itself.

Examples of this are parent-child relationships in hierarchies and people who have relationships to other people (literally parent, child).

The case you give of using a table in different roles is not that uncommon.

If the different values in the table are not related in some kind of essential nature, I would have a problem with the design as a case of the "one true lookup" where one stores a variety of entity-lookups with a type code - so you get billing addresses, customers, shipping addresses, products and all sorts of things all in the same lookup table.

In data warehouses, it is also possible to have dimensions used in different roles, particularly date or time dimensions.

A smell would be if the same lookup table was joined over and over for columns which are being used as an array - for instance first_child, second-child, third_child - since this is typically a violation of normalization.

My only concerns with what you have shown here are:

The magic numbers which appear to be used to pick a 3-dimensional space in the 5-dimensional space of all values. I assume these are themselves defined in a table somewhere (pk4, pk5, description).

At that point I would consider turning each into a view to make it more readable.

In SQL Server (or DB2, which has the same construct), I would actually consider using an inline table-valued function parameterized on pk4 and pk5 which would help a little to prevent someone from accidentally joining with incomplete join criteria - and ending up with one ITVF instead of many views.

But all this is simply clean up - the design of the query and tables seems pretty sound to me.

answered Jun 21 '12 at 20:20

Cade Roux

88,164
40
182
265

In my real code, the magic numbers are indeed the results of subqueries (not relevant to the example). If my SQL engine supported user-defined functions of any sort I would parameterize on pk1, pk2, pk3, pk4 and pk5 and so avoid any joins in the outer query (this values table *is* a function really). – jl6 Jun 21 '12 at 20:43
@jl6 scalar functions are a lot slower than tables, but a table-valued function is a parametrized view - I have done calculation lookup tables exactly like this for a multi-dimensional space to precompute a few million distinct parameter values instead of hundreds of millions of calls to a scalar function. – Cade Roux Jun 21 '12 at 20:55
I'm talking about retaining the pre-computed values table, but defining a function of 5 variables that returns a scalar value by SELECTing it from that table using a where clause based on the 5 parameters. So, I'd call that function 24 times and combine the results according to the required formula. Isn't that essentially a more direct version of what I'm doing with all the left joins? – jl6 Jun 21 '12 at 21:17
@jl6 Sure, but performance will suck. Typically the optimizer can do things with joins, they can do little with scalar functions, and a scalar function that just selects values out of a table introduces a whole world of IO overhead the optimizer can't work with because it's hidden in a black box. – Cade Roux Jun 21 '12 at 21:26
OK, I understand now. The solution using scalar functions would have to read the table 24 times whereas a cleverly optimized join might potentially read the table only once. – jl6 Jun 21 '12 at 21:31
@jl6 24 reads per key row almost guaranteed, whereas the set-based version can have all kinds of tricks in the execution plan. – Cade Roux Jun 21 '12 at 21:39

score 1 · Answer 2 · edited May 23 '17 at 12:04

1

I had started to answer this question with a whole lookup table example, but realized that there's a much bigger underlying problem.

Check out these references.

http://en.wikipedia.org/wiki/Entity-attribute-value_model

Key value pairs in relational database

The type of schema that you're working with is counter-intuitive to the idea of relational databases. Try flattening out your tables so you don't have a key/value relationship.

If your equation is a non-regular aggregate (yeah, don't look up that one), such as v1.val + v2.val / v3.val, then you want to aim to have every variable in that equation in a single row in one or more tables after one or more (but fewer than 24) joins.

Entity-attribute-value schemas suffer from poor performance, difficult maintenance, and really, really bad smells.

...

That didn't answer the question, so here goes. Use a view like Icarus suggests in the comments, or burn the whole thing down and rebuild something a bit more normalized.

edited May 23 '17 at 12:04

Community

1
1

answered Jun 21 '12 at 19:52

Nick Vaccaro

5,428
6
38
60

It indeed looks like an EAV model, but the composite key looks strange (I would *at least* combine the {pk4,pk5} keys}, maybe even {pk1,pk2,pk3}) . Performance-wise there is nothing wrong with EAV. Older DBMSses had problems with large range tables, but nowadays things are better. – wildplasser Jun 21 '12 at 19:57
I politely disagree that EAVs perform well. Indexes perform on keys that are built across rows, not between them. Definitely happy to be proven wrong though. : ) – Nick Vaccaro Jun 21 '12 at 20:05
2nd edit to give a little background on what the data represents. I'm not sure whether this is EAV or not (I don't recognise it from the wiki description) but performance is definitely not an issue for my use case (hundreds of thousands of rows in the values table). – jl6 Jun 21 '12 at 20:09
This reminds me of a correlation calculation. Not in the way it's calculated, but in the way SQL really screws things up. Throughout the calculation, you end up with a whole bunch of intermediate variables. Each could be stored in a distinct column in one or more tables, rather than all in the same table. That way, you can reference everything for a single set of inputs all at once. I.e., v1.value and v2.value are different variables in the equation, so why would they be stored in the same database column? – Nick Vaccaro Jun 21 '12 at 20:19
I recognize that I have 24 intermediate variables here. Are you saying it would be better to pivot them up into distinct columns in the same row of an intermediate table? Isn't that what I'm doing, minus the intermediate table? – jl6 Jun 21 '12 at 20:26
I'll probably have to bail on this answer at this point. I'd have to take a good look at the tables and data before giving a halfway decent answer. I like wildplasser's CTE idea. Go with that one. – Nick Vaccaro Jun 21 '12 at 20:39
I am working on it. @jl6: Could you give the full 24 variable expression? (or a surrogate) Dammit! points already given ;-[ – wildplasser Jun 21 '12 at 20:46
2

@Norla, saying EAV always performs poorly is like saying cursors, loops and cross joins always perform poorly. They don't always, and sometimes they are the right answer. There are very few absolutes, and this is not one of them - it's all about how you use the tool, not the tool itself. EAV can be abused just like any other construct. Please read https://sqlblog.org/blogs/aaron_bertrand/archive/2009/11/19/what-is-so-bad-about-eav-anyway.aspx – Aaron Bertrand Jun 21 '12 at 20:47
I just had the query with 100K* 24 (only 6 terms used) run in 1sec (pg 9.1). Not too bad for an EAV. – wildplasser Jun 21 '12 at 20:49
@AaronBertrand I can definitely see how SaaS benefits from EAV design, and maybe I am just having trouble seeing the forest through the trees. However, the result column seems to be the output of a static equation, rather than some dynamic aggregation. – Nick Vaccaro Jun 21 '12 at 20:57
wildplasser: Thank you for your comments. I would still be interested in your CTE version. I cannot use CTEs but if it gives the same result as my version and smells better, then it may well be the best answer. For the actual expression you could use any "non-regular aggregate" expression (thank you Norla, I like that description) involving plus, minus, times, divide, etc.. – jl6 Jun 21 '12 at 20:59
@jl6 Ty much. Best of luck in implementation. BTW - This question just popped up a few minutes ago. Great example of where _not_ to use an EAV. http://stackoverflow.com/questions/11146382/sql-issue-one-to-many-relationship-and-eav-model – Nick Vaccaro Jun 21 '12 at 21:04
Without a third table to join with there is hardly an advantage in using CAVs. Maybe I'll squeeze out the composite keys, too, and replace them with surrogates. They look too ugly to me. – wildplasser Jun 21 '12 at 21:04
@Norla: Based on what I've said about this values table representing a function, do you still consider it an EAV model? I don't see where the attributes come in, or the sparseness. Seems more 6NF to me. – jl6 Jun 21 '12 at 21:06
@Norla: I just saw it. I would do it if the OP had shown some effort. The EAV is disputable, but not wrong. At work, I have more pathological cases, which still perform very good. – wildplasser Jun 21 '12 at 21:08
@wildplasser I just love how it gets transformed right back into a single table on the display side. _Just store it as a table in the first place!_ : ) – Nick Vaccaro Jun 21 '12 at 21:09
@jl6: it certainly is an EAV model: you could add a set of parameters {fk4,fk5} *without changing the data model* . – wildplasser Jun 21 '12 at 21:10
@wildplasser Really good example. Was trying to come up with one. I'd never admit this, but everything after 3NF just makes me curl up in the fetal position. – Nick Vaccaro Jun 21 '12 at 21:13
@Norla: but the set of attributes is *flexible*: you can add a "field" to a "table" , without changing the table structure. Think about adding an attribute "test_result" (**required**) when the diagnosis is 'cancer'. – wildplasser Jun 21 '12 at 21:14

Robert Penridge · Answer 3 · 2012-06-22T06:52:30.560

Seeing as you are using SAS you could potentially clean it up a little using a data step merge. Something along the lines of:

data x;
  merge key 
        values(rename=value=value1 where=(pk4=100 and pk5=200))
        values(rename=value=value2 where=(pk4=400 and pk5=800))
        values(rename=value=value3 where=(pk4=900 and pk5=700))
        etc...
        ;
  by pk1 pk2 pk3;
  result = ...;
  keep pk: value: result;
run;

I don't have SAS in front of me now so I'm not going to type out all the code and test it but you get the idea. To me I think it would look a lot cleaner than SQL. It's a pretty rare find when the datastep merge offers a nicer alternative then the SQL approach.

score 0 · Answer 4 · answered Jun 21 '12 at 19:15

0

If I understand what you're looking for it's easier to use:

SELECT field1, field2 from Table1 t
where exists (SELECT 1 from LookupTable l where l.fieldX=t.fieldX)

answered Jun 21 '12 at 19:15

Kris Gruttemeyer

872
7
19

score 0 · Answer 5 · answered Jun 21 '12 at 21:40

I have enumerated the parameter pairs for pk4,pk5 into num (I replaced the actual numbers by 4*6):

CREATE TABLE zparams
        ( num INTEGER
        , pk4 INTEGER
        , pk5 INTEGER
        , PRIMARY KEY (pk4,pk5)
        , CONSTRAINT ze_other UNIQUE (num)
        );
INSERT INTO zparams(num,pk4,pk5) VALUES
 (1,1,1), (2,1,2), (3,1,3), (4,1,4), (5,1,5), (6,1,6)
, (7,2,1), (8,2,2), (9,2,3), (10,2,4), (11,2,5), (12,2,6)
, (13,3,1), (14,3,2), (15,3,3), (16,3,4), (17,3,5), (18,3,6)
, (19,4,1), (20,4,2), (21,4,3), (22,4,4), (23,4,5), (24,4,6)
        ;

The gain from the CTE is rather futile:

EXPLAIN ANALYZE
WITH zzz AS (
        SELECT v.pk1 AS pk1
        , v.pk2 AS pk2
        , v.pk3 AS pk3
        , p.num AS num
        , v.value AS value
        FROM zparams p
        JOIN zvalues v ON v.pk4 = p.pk4 AND v.pk5=p.pk5
        )
 SELECT k.pk1, k.pk2, k.pk3,
        v1.value + v2.value - v3.value * (v4.value / v5.value) + v24.value as result
 FROM zkeys k
     LEFT JOIN zzz v1
         ON v1.pk1=k.pk1 AND v1.pk2=k.pk2 AND v1.pk3=k.pk3
         AND v1.num=1

     LEFT JOIN zzz v2
         ON v2.pk1=k.pk1 AND v2.pk2=k.pk2 AND v2.pk3=k.pk3
         AND v2.num=2

     LEFT JOIN zzz v3 ON v3.pk1=k.pk1 AND v3.pk2=k.pk2 AND v3.pk3=k.pk3
         AND v3.num=3

     LEFT JOIN zzz v4
         ON v4.pk1=k.pk1 AND v4.pk2=k.pk2 AND v4.pk3=k.pk3
         AND v4.num=4

     LEFT JOIN zzz v5
         ON v5.pk1=k.pk1 AND v5.pk2=k.pk2 AND v5.pk3=k.pk3
         AND v5.num=5

     LEFT JOIN zzz v24
         ON v24.pk1=k.pk1 AND v24.pk2=k.pk2 AND v24.pk3=k.pk3
         AND v24.num=24
        ;

And, tested on 100K*24variables (6used), -actually- the CTE performs worse (4.5 sec) than the plain joins (1 sec) on {pk1,pk2,pk3,pk4=constant,pk5=constant}. But, at least it looks cleaner.

No. He is just saying that this is *not* a hierarchical thing (not a tree) , but a 3+2 dimension key-parameter space (he is a BI guy obviously). He also makes an attempt to reduce the 5D space into a 2D space, which is logical (I said the same thing in a comment). And, his conclusion: the datamodel is not that bad. IMHO the function-thing is cumbersome performancewise, and very much dependent on the DBMS being used. — wildplasser, Jun 21 '12 at 22:18

score 0 · Answer 6 · answered Jun 22 '12 at 07:08

As an alternative to my above answer you could also do something like this I believe:

CREATE TABLE XX AS 
SELECT k.pk1, k.pk2, k.pk3, v1.pk4, v1.pk5, v1.value
     FROM key k
 LEFT JOIN values v1
     on v1.pk1=k.pk1
     and v1.pk2=k.pk2
     and v1.pk3=k.pk3
     and ( 
          (v1.pk4=100 and v1.pk5=200) or
          (v1.pk4=400 and v1.pk5=800) or
          (v1.pk4=700 and v1.pk5=900) 
         )


proc transpose data=xx out=trans;
  by pk1 pk2 pk3;
  var value;
run;

data result;
  set trans;
  result = ...;
run;

Again, I don't have SAS in front of me so I can't test it out but I'm sure you get the picture. This way you only have the one join happening and you break the rest of it out into additional steps. I'm not too sure how this will perform as it probably won't use the index any more. Just throwing some ideas out there...

I think this is basically the way to go (the SAS tag was added later). IMHO there still is a problem when one or more of the {pk4,pk5} tuples are missing. A solution would be to perform a left/right join with the complete set (N=24) of {pk4,pk5} tuples. — wildplasser, Jun 22 '12 at 08:44

Is a SQL self-join the best way to implement "lookups"?

6 Answers6