0

I need to join two tables via their project’s names. But for a few project names that meet a specific criteria, I need the join to be matched to their descriptions (job description is like a name and is unique). I am not 100% sure how to do this. Can a case expression be applied? I have provided what I have so far but it’s not joining properly when I am doing a case expression on names that are like BTG –.

SELECT
       [Name] AS 'NAME'
      ,[DATA_Id] AS 'ID_FIELD'
      ,format([ApprovedOn], 'MM/dd/yyyy') as 'DATE_APPROVED'
      ,[DATA_PROJECT_NAME]
         ,[PHASE_NAME]
          ,[DATA_JOB_ID]
      ,[JOB_TYPE]
      ,[SUB_TYPE]
      ,format([CREATED_DATE], 'MM/dd/yyyy') as 'DATE_CREATED'
,CASE 
        WHEN [DATA_JOB_ID] = [DATA_Id] THEN 'OK'
        WHEN [DATA_JOB_ID] != [DATA_Id] THEN 'NEED DATA NUMBER'
        ELSE 'NEED DATA NUMBER'
        END AS ACTION_SPECIALISTS
      ,DATA_PROJECTS

  FROM [MI].[MI_B_View].[app_build]
  LEFT JOIN 
  (SELECT * , 
    CASE
    WHEN [DATA_PROJECT_NAME] LIKE 'BTG -%' THEN [JOB_DESCRIPTION]
    ELSE [DATA_PROJECT_NAME]
    END AS DATA_PROJECTS
  FROM [ExternalUser].[DATA].[JOB] WHERE [JOB_DESCRIPTION] LIKE '%ROW%' AND REGION = 'CITY') AS B

  ON  [Name] = [DATA_PROJECTS]

  WHERE 
      REGION_ID = 1
  AND APPROVED = 1

  ORDER BY [ApprovedOn] DESC
Dale K
  • 25,246
  • 15
  • 42
  • 71
Kam S
  • 15
  • 6
  • This is where some simple sample data demonstrating the "regular" join and the "alternative" join would help immensely. – SMor Nov 06 '19 at 19:52
  • 1) Don't use * in a select... be specific in your fields... 2) alias the tables, and use them in the join... tht could be the problem with your join... Name could be in both, but because one used select *, it's hiding... of Data_Projects could be in both, but again, you used select * ... meaning there's a potential column naming ambiguity. Actually, now that I look again, I see that there is an ambiguity... try aliasing BOTH sides of your joins, and then be exacting on which cols you want to compare and select. – TechGnome Nov 06 '19 at 19:57

2 Answers2

1

TL; DR: The answer by Caius Jard is correct - you can join on anything, as long as it evaluates to true or false (ignoring unknown).

Unfortunately, the way you join between two tables can have drastically different performance depending on your methodology. If you join on an expression, you will usually get very poor performance. Using computed columns, materializing the intermediate result in a table, or splitting up your join conditions can all help with poor performance.

Joins are not the only place where expressions can ding you; grouping, aggregates, filters, or anything that relies on a good cardinality estimate will suffer when expressions are involved.


When I compare two methods of joining (they are functionally equivalent despite the new magic column; more on that later)

SELECT *
  FROM #Build AppBuild
    LEFT OUTER JOIN #Job Job
      ON ( AppBuild.Name = Job.DATA_PROJECT_NAME
           AND Job.DATA_PROJECT_NAME NOT LIKE 'BTG -%' )
        OR ( Job.DATA_PROJECT_NAME LIKE 'BTG -%'
             AND Job.JOB_DESCRIPTION = AppBuild.Name );

SELECT *
  FROM #Build AppBuild
    LEFT OUTER JOIN #Job Job
      ON AppBuild.Name = Job.JoinOnMe;

The resulting query plans have huge differences:

join on LIKE vs computed column

You'll notice that the estimated cost of the first join is much higher - but that doesn't even tell the whole story. If I actually run these two queries with ~6M rows in each table, I end up with the second one finishing in ~7 seconds, and the first one being nowhere near done after 2 minutes. This is because the join predicate was pushed down onto the #Job table's table scan:

Predicate Pushdown

SQL Server has no idea what percentage of records will have a DATA_PROJECT_NAME (NOT) LIKE 'BTG -%', so it picks an estimate of 1 row. This then leads it to pick a nested loop join, and a sort, and a spool, all of which really end up making things perform quite poorly for us when we get far more than 1 row out of that table scan.

The fix? Computed columns. I created my tables like so:

CREATE TABLE #Build
(
  Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);

CREATE TABLE #Job
(
  JOB_DESCRIPTION   varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
  DATA_PROJECT_NAME varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
  JoinOnMe          AS CASE WHEN DATA_PROJECT_NAME LIKE N'BTG -%' THEN DATA_PROJECT_NAME
                            ELSE JOB_DESCRIPTION END
);

It turns out that SQL Server will maintain statistics on JoinOnMe, even though there is an expression inside of it and this value has not been materialized anywhere. If you wanted to, you could even index the computed column.

Because we have statistics on JoinOnMe, a join on it will give a good cardinality estimate (when I tested it was exactly correct), and thus a good plan.

If you don't have the freedom to alter the table, then you should at least split the join into two joins. It may seem counter intuitive, but if you're ORing together a lot of conditions for an outer join, SQL Server will usually get a better estimate (and thus better plans) if each OR condition is separate, and you then COALESCE the result set.

When I include a query like this:

SELECT AppBuild.Name,
       COALESCE( Job.JOB_DESCRIPTION, Job2.JOB_DESCRIPTION ) JOB_DESCRIPTION,
       COALESCE( Job.DATA_PROJECT_NAME, Job2.DATA_PROJECT_NAME ) DATA_PROJECT_NAME
  FROM #Build AppBuild
    LEFT OUTER JOIN #Job Job
      ON ( AppBuild.Name = Job.DATA_PROJECT_NAME
           AND Job.DATA_PROJECT_NAME NOT LIKE 'BTG -%' )
    LEFT OUTER JOIN #Job Job2
      ON ( Job2.DATA_PROJECT_NAME LIKE 'BTG -%'
           AND Job2.JOB_DESCRIPTION = AppBuild.Name );

It is also 0% of total cost, relative to the first query. When compared against joining on the computed column, the difference is about 58%/42%

One vs two joins


Here is how I created the tables and populated them with test data

DROP TABLE IF EXISTS #Build;
DROP TABLE IF EXISTS #Job;
CREATE TABLE #Build
(
  Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);

CREATE TABLE #Job
(
  JOB_DESCRIPTION   varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
  DATA_PROJECT_NAME varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
  JoinOnMe          AS CASE WHEN DATA_PROJECT_NAME LIKE N'BTG -%' THEN DATA_PROJECT_NAME
                            ELSE JOB_DESCRIPTION END
);

INSERT INTO #Build
( Name )
  SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
    FROM master.dbo.spt_values
      CROSS APPLY master.dbo.spt_values SV2;

INSERT INTO #Job
( JOB_DESCRIPTION, DATA_PROJECT_NAME )
  SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )),
         CASE WHEN ROUND( RAND(), 0 ) = 1 THEN CAST(ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )) AS nvarchar(20))
              ELSE 'BTG -1' END
    FROM master.dbo.spt_values SV
      CROSS APPLY master.dbo.spt_values SV2;
Dan Oberlam
  • 2,435
  • 9
  • 36
  • 54
0

Sure, any expression that evaluates to a truth can be used in a join:

SELECT * 
FROM 
  person
  INNER JOIN
  country 
  ON
    country.name =
    CASE person.homeCity
    WHEN 'London' THEN 'England'
    WHEN 'Amsterdam' THEN 'Holland'
    ELSE person.homecountry
    END

Suppose homecountry features records like 'united kingdom', 'great britain' and 'netherlands' but these don't match up with our country names in the countries table - we could use a case when to convert them (and I've case'd on the city name just to demo that it doesn't have to be anything to do with the country in the case), but for all the others (the ELSE case) we just pass through the country name from the person table unchanged

Ultimately the CASE WHEN will output some string and this will be matched against the other table column (but it could be matched against another case when etc)

In your scenario you might find you can avoid all this and just write something using AND and OR, like

a JOIN b 
ON
  (a.projectname like 'abc%' and a.projectname = b.description) OR
  (a.projectname like '%def' and a.whatever = b.othercolumn)

Evaluations in a CASE WHEN are short circuited, evaluating from left to right

Remember; anything that ultimately evaluates to a truth can be used in an ON. Even ON 5<10 is valid (joins all rows to all other rows because it's always true)

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • Evaluations are not short circuited (or even necessarily left to right). [ref1](https://blogs.msdn.microsoft.com/bartd/2011/03/03/dont-depend-on-expression-short-circuiting-in-t-sql-not-even-with-case/) [ref2](https://stackoverflow.com/a/909770/3076272). – Dan Oberlam Nov 06 '19 at 20:26