How can I UPDATE a table based on another table, using values from groups of rows?

Question

I have two tables:

CREATE TABLE Employee (
   Site       ????         ????,
   WorkTypeId char(2)  NOT NULL,
   Emp_NO     int      NOT NULL,
   "Date"     ????     NOT NULL  
);

CREATE TABLE PTO (
   Site       ????         ????,
   WorkTypeId char(2)      NULL,
   Emp_NO     int      NOT NULL,
   "Date"     ????     NOT NULL  
);

I would like to update values in PTO's WorkTypeId column:

EMP NO in Employee (the lookup table) and PTO should match.
A single WorkTypeId value should be picked from only the first occurrence of the month.

For example, given this sample input data:

TABLE Employee:

Site	WorkTypeId	Emp_NO	Date
5015	MB	1005	2022-02-01
5015	MI	1005	2022-02-04
5015	PO	1005	2022-02-04
5015	ME	2003	2022-01-01
5015	TT	2003	2022-01-10

TABLE PTO:

Site	Emp_NO	Date
5015	1005	2022-02-03
5015	1005	2022-02-14
5014	2003	2022-01-09

For example:

Given Employee with Emp_NO = 1005...
- ...there are 3 rows for that Emp_NO in the Employee table, with 3 distinct WorkTypeId values, but differing Date values.
- So pick the WorkTypeId value for the earliest Date (2022-02-01), which is 'MB'
- So Emp_NO gets WorkTypeId = 'MB'.
- And use that single value to fill 1005's WorkTypeId cells in the PTO table.
- But also match by month.

So the expected output in the PTO table is

Site	WorkTypeId	Emp_NO	Date
5015	MB	1005	2022-02-03
5015	MB	1005	2022-02-14
5014	ME	2003	2022-01-09

Show us table and view definitions, sample table data and the expected result - all as formatted text (no images.) [mcve] — jarlh, Mar 04 '22 at 15:16
@jarlh I have updated the data into text ,Also DBMS I am suing is SQL server — Rohan Jaiswal, Mar 04 '22 at 19:40
`Single Work type should be picked from only the first occurrence of the month` What about "Site"? The PTO table contains Site# as well. Do you want the the 1st occurrence of the month per "Emp_No" **and** "Site" or just Emp_No only? — SOS, Mar 04 '22 at 19:55
When you say "Update values", do you mean you want to run an `UPDATE` DML statement to modify data on-disk, or do you mean you just want to transform/mutate the query's return values inside a (inherently read-only) `SELECT` statement? — Dai, Mar 05 '22 at 05:31
@RohanJaiswal I've reworded and reformatted your question for clarity. Please improve upon my changes if I've misunderstood you. Also, please complete the `CREATE TABLE` statements by replacing the `????` placeholders with the _actual_ column types. **It's important that we know exactly what column-type the `Date` columns have**. — Dai, Mar 05 '22 at 05:54
If `Employee` really is the name of the table, then it needs a better name because it doesn't actually contain Employee records (i.e. where `Emp_No` is the `PRIMARY KEY`). — Dai, Mar 05 '22 at 05:56
@RohanJaiswal In your post you said that values should also match by month, but that adds ambiguity. Please update your example data to show how data for the same `Emp_No` for multiple months should work. — Dai, Mar 05 '22 at 05:57
@RohanJaiswal What should happen if an `Emp_No` row in `PTO.WorkTypeId` _already has_ a non-`NULL` value? Should it overwrite it, preserve it, or be added as a new separate row? — Dai, Mar 05 '22 at 05:58

SOS · Answer 1 · 2022-03-06T03:50:03.583

1

Update 2002-03-05

Leaving this here for posterity, but I'd recommend reading Dai's excellent write up on different approaches to this problem.

Try a CROSS APPLY to grab the first Employee record with a matching month and year.

Note: Use OUTER APPLY to always return all PTO records, even when no matching WorkTypeId was found.

SELECT p.Site
       , e.WorkTypeId
       , p.Emp_No
       , p.[Date]
FROM  PTO p CROSS APPLY 
        (
          SELECT TOP 1 WorkTypeId
          FROM   Employee e 
          WHERE  e.Emp_No = p.Emp_No
          AND    MONTH(e.[Date]) = MONTH(p.[Date])
          AND    YEAR(e.[Date]) = YEAR(p.[Date])
          ORDER BY [Date] ASC
        )e

Results:

Site | WorkTypeId | Emp_No | Date      
---: | :--------- | -----: | :---------
5015 | MB         |   1005 | 2022-02-03
5015 | MB         |   1005 | 2022-02-14
5014 | ME         |   2003 | 2022-01-09

db<>fiddle here

edited Mar 06 '22 at 03:50

answered Mar 05 '22 at 05:28

SOS

6,430
2
11
29

How is an `OUTER APPLY` better to any other `JOIN` type in this situation? As this is evaluated per-row, I think the execution-plan would be excessively expensive... can you show us the plan you get? – Dai Mar 05 '22 at 05:32
@Dai - Yeah, I agree the plan is not ideal. The reason for using APPLY was that it seemed like they only needed the first matching row for an employee for the same month/year. Since there's no start/end date range in either table, I couldn't think of a reasonable JOIN to use. Only subqueries or APPLY. If you can think of a better option, I'm all ears. Always up for learning something new :-) – SOS Mar 05 '22 at 05:38
1

I just posted my solution now, if you're interested: instead of `APPLY`, it uses a `GROUP BY year+month` query with a self-JOIN to get the first values for each employee's month range - as well as a simpler query that uses `FIRST_VALUE`: https://stackoverflow.com/a/71360743/159145 – Dai Mar 05 '22 at 08:46
1

Out of curiosity, I ran your version (though I changed it to an `UPDATE p FROM PTO CROSS APPLY( .. )` statement) and to my surprise the execution-plan for that was a _lot_ simpler than any of my queries - which surprised me because I thought that an `APPLY`-type `JOIN` to a `SELECT TOP` derived-table would always result in a RBAR-query, instead it actually was a _single_ table-scan (and sort) - so I'm happy to learn something new today _upvoted!_ – Dai Mar 05 '22 at 09:23

Dai · Accepted Answer · 2022-03-05T09:30:15.167

Getting a value from a column different to the column used in a MIN/MAX expression in a GROUP BY query still remains a surprisingly difficult thing to do in SQL, and while modern versions of the SQL language (and SQL Server) make it easier, they're completely non-obvious and counter-intuitive to most people as it necessarily involves more advanced topics like CTEs, derived-tables (aka inner-queries), self-joins and windowing-functions despite the conceptually simple nature of the query.

Anyway, as-ever in modern SQL, there's usually 3 or 4 different ways to accomplish the same task, with a few gotchas.

Preface:

As Site, Date, Year, and Month are all keywords in T-SQL, I've escaped them with double-quotes, which is the ISO/ANSI SQL Standards compliant way to escape reserved words.
- SQL Server supports this by default. If (for some ungodly reason) you have SET QUOTED IDENTIFIER OFF then change the double-quotes to square-brackets: []
I assume that the Site column in both tables is just a plain' ol' data column, as such:
- It is not a PRIMARY KEY member column.
- It should not be used as a GROUP BY.
- It should not be used in a JOIN predicate.
All of the approaches below assume this database state:

CREATE TABLE "Employee" (
    "Site"     int      NOT NULL,
    WorkTypeId char(2)  NOT NULL,
    Emp_NO     int      NOT NULL,
    "Date"     date     NOT NULL  
);

CREATE TABLE "PTO" (
    "Site"     int      NOT NULL,
    WorkTypeId char(2)      NULL,
    Emp_NO     int      NOT NULL,
    "Date"     date     NOT NULL  
);

GO

INSERT INTO "Employee" ( "Site", WorkTypeId, Emp_NO, "Date" )
VALUES
( 5015, 'MB', 1005, '2022-02-01' ),
( 5015, 'MI', 1005, '2022-02-04' ),
( 5015, 'PO', 1005, '2022-02-04' ),
( 5015, 'ME', 2003, '2022-01-01' ),
( 5015, 'TT', 2003, '2022-01-10' );

INSERT INTO "PTO" ( "Site", WorkTypeId, Emp_NO, "Date" )
VALUES
( 5015, NULL, 1005, '2022-02-03' ),
( 5015, NULL, 1005, '2022-02-14' ),
( 5014, NULL, 2003, '2022-01-09' );

Both approaches define CTEs e and p that extend Employee and PTO respectively to add computed "Year" and "Month" columns, which avoids having to repeatedly use YEAR( "Date" ) AS "Year" in GROUP BY and JOIN expressions.
- I suggest you add those as computed-columns in your base tables, if you're able, as they'll be useful generally anyway. Don't forget to index them appropriately too.

Approach 1: Composed CTEs with elementary aggregates, then `UPDATE`:

WITH
-- Step 1: Extend both the `Employee` and `PTO` tables with YEAR and MONTH columns (this simplifies things later on):
e AS (
    SELECT
        Emp_No,
        "Site",
        WorkTypeId,
        "Date",

        YEAR( "Date" ) AS "Year",
        MONTH( "Date" ) AS "Month"
    FROM
        Employee
),
p AS (
    SELECT
        Emp_No,
        "Site",
        WorkTypeId,
        "Date",

        YEAR( "Date" ) AS "Year",
        MONTH( "Date" ) AS "Month"
    FROM
        PTO
),
-- Step 2: Get the MIN( "Date" ) value for each group:
minDatesForEachEmployeeMonthYearGroup AS (
    SELECT
        e.Emp_No,
        e."Year",
        e."Month",

        MIN( "Date" ) AS "FirstDate"
    FROM
        e
    GROUP BY
        e.Emp_No,
        e."Year",
        e."Month"
),
-- Step 3: INNER JOIN back on `e` to get the first WorkTypeId in each group:
firstWorkTypeIdForEachEmployeeMonthYearGroup AS (
    /* WARNING: This query will fail if multiple rows (for the same Emp_NO, Year and Month) have the same "Date" value. This can be papered-over with GROUP BY and MIN, but I don't think that's a good idea at all). */
    SELECT
        e.Emp_No,
        e."Year",
        e."Month",

        e.WorkTypeId AS FirstWorkTypeId
    FROM
        e
        INNER JOIN minDatesForEachEmployeeMonthYearGroup AS q ON
            e.Emp_NO = q.Emp_NO
            AND
            e."Date" = q.FirstDate
)
-- Step 4: Do the UPDATE.
-- *Yes*, you can UPDATE a CTE (provided the CTE is "simple" and has a 1:1 mapping back to source rows on-disk).
UPDATE
    p
SET
    p.WorkTypeId = f.FirstWorkTypeId
FROM
    p
    INNER JOIN firstWorkTypeIdForEachEmployeeMonthYearGroup AS f ON
        p.Emp_No = f.Emp_No
        AND
        p."Year" = f."Year"
        AND
        p."Month" = f."Month"
WHERE
    p.WorkTypeId IS NULL;

Here's a screenshot of SSMS showing the contents of the PTO table from before, and after, the above query runs:

Approach 2: Skip the self-`JOIN` with `FIRST_VALUE`:

This approach gives a shorter, slightly simpler query, but requires SQL Server 2012 or later (and that your database is running in compatibility-level 110 or higher).

Surprisingly, you cannot use FIRST_VALUE in a GROUP BY query, despite its obvious similarities with MIN, but an equivalent query can be built with SELECT DISTINCT:

WITH
-- Step 1: Extend the `Employee` table with YEAR and MONTH columns:
e AS (
    SELECT
        Emp_No,
        "Site",
        WorkTypeId,
        "Date",

        YEAR( "Date" ) AS "Year",
        MONTH( "Date" ) AS "Month"
    FROM
        Employee
),
firstWorkTypeIdForEachEmployeeMonthYearGroup AS (

    SELECT
        DISTINCT
        e.Emp_No,
        e."Year",
        e."Month",
        FIRST_VALUE( WorkTypeId ) OVER (
            PARTITION BY
                Emp_No,
                e."Year",
                e."Month"
            ORDER BY
                "Date" ASC
        ) AS FirstWorkTypeId
    FROM
        e
)
-- Step 3: UPDATE PTO:
UPDATE
    p
SET
    p.WorkTypeId = f.FirstWorkTypeId
FROM
    PTO AS p
    INNER JOIN firstWorkTypeIdForEachEmployeeMonthYearGroup AS f ON
        p.Emp_No = f.Emp_No
        AND
        YEAR( p."Date" ) = f."Year"
        AND
        MONTH( p."Date" ) = f."Month"
WHERE
    p.WorkTypeId IS NULL;

Doing a SELECT * FROM PTO after this runs gives me the exact same output as Approach 2.

Approach 2b, but made shorter:

Just so @SOS doesn't feel too smug about their SQL being considerably more shorter than mine , the Approach 2 SQL above can be compacted down to this:

WITH empYrMoGroups AS (
    SELECT
        DISTINCT
        e.Emp_No,
        YEAR( e."Date" ) AS "Year",
        MONTH( e."Date" ) AS "Month",
        FIRST_VALUE( e.WorkTypeId ) OVER (
            PARTITION BY
                e.Emp_No,
                YEAR( e."Date" ),
                MONTH( e."Date" )
            ORDER BY
                e."Date" ASC
        ) AS FirstWorkTypeId
    FROM
        Employee AS e
)
UPDATE
    p
SET
    p.WorkTypeId = f.FirstWorkTypeId
FROM
    PTO AS p
    INNER JOIN empYrMoGroups AS f ON
        p.Emp_No = f.Emp_No
        AND
        YEAR( p."Date" ) = f."Year"
        AND
        MONTH( p."Date" ) = f."Month"
WHERE
    p.WorkTypeId IS NULL;

The execution-plans for both Approach 2 and Approach 2b are almost identical, excepting that Approach 2b has an extra Computed Scalar step for some reason.
The execution plans for Approach 1 and Approach 2 are very different, however, with Approach 1 having more branches than Approach 2 despite their similar semantics.
But my execution-plans won't match yours because it's very context-dependent, especially w.r.t. what indexes and PKs you have, and if there's any other columns involved, etc.

Approach 1's plan looks like this:

Approach 2b's plan looks like this:

@SOS's plan, for comparison, is a lot simpler... and I honestly don't know why, but it does show how good SQL Server's query optimizer is thesedays:

Very interesting (and thorough)! What version of SQL Server was used for the tests? I tested with SQL Server 2016 and about ~35k rows of random data. What surprised me (though maybe it shouldn't) was that despite the simpler plan for APPLY, when I tested all 4 queries the IO statistics were consistently *better for "Approach 1" than APPLY* (or the other approaches). Approach 1 had consistently lower numbers (CPU/elapsed time/scan count/logical reads). [Approach 1 Plan](https://www.brentozar.com/pastetheplan/?id=H1waXqZZc) and [APPLY plan](https://www.brentozar.com/pastetheplan/?id=BJe145Wb9) — SOS, Mar 06 '22 at 03:39

How can I UPDATE a table based on another table, using values from groups of rows?

2 Answers2

Preface:

Approach 1: Composed CTEs with elementary aggregates, then UPDATE:

Approach 2: Skip the self-JOIN with FIRST_VALUE:

Approach 2b, but made shorter:

Approach 1: Composed CTEs with elementary aggregates, then `UPDATE`:

Approach 2: Skip the self-`JOIN` with `FIRST_VALUE`: