26

Scenario

We have a pretty standard data import process in which we load a staging table, then MERGE it into a target table.

New requirements (green) involve capturing a subset of the imported data into a separate queue table for completely unrelated processing.

Scenario schema

The "challenge"

(1) The subset consists of a selection of the records: those that were newly inserted into the target table only.

(2) The subset is a projection of some of the inserted columns, but also at least one column that is only present in the source (the staging table).

(3) The MERGE statement already uses the OUTPUT..INTO clause strictly to record the $actions taken by MERGE, so that we can PIVOT the result and COUNT the number of insertions, updates and deletions for statistics purposes. We don't really enjoy buffering the actions for the entire dataset like that and would prefer aggregating the sums on the fly. Needless to say, we don't want to add more data to this OUTPUT table.

(4) We don't want to do the matching work that the MERGE performs a second time for whatever reason, even partially. The target table is really big, we can't index everything, and the operation is generally quite expensive (minutes, not seconds).

(5) We're not considering roundtripping any output from the MERGE to the client just so that the client can route it to the queue by sending it back immediately. The data has to stay on the server.

(6) We wish to avoid buffering the entire dataset in temporary storage between staging and the queue.

What would be the best way of going about it?

Failures

(a) The requirement to enqueue only the inserted records prevents us from targeting the queue table directly in an OUTPUT..INTO clause of the MERGE, as it doesn't allow any WHERE clause. We can use some CASE trickery to mark the unwanted records for subsequent deletion from the queue without processing, but this seems crazy.

(b) Because some columns intended for the queue don't appear in the target table, we cannot simply add an insertion trigger on the target table to load the queue. The "data flow split" has to happen sooner.

(c) Since we already use an OUTPUT..INTO clause in the MERGE, we cannot add a second OUTPUT clause and nest the MERGE into an INSERT..SELECT to load the queue either. This is a shame, because it feels like a completely arbitrary limitation for something that works very well otherwise; the SELECT filters only the records with the $action we want (INSERT) and INSERTs them in the queue in a single statement. Thus, the DBMS can theoretically avoid buffering the whole dataset and simply stream it into the queue. (Note: we didn't pursue and it's likely that it actually didn't optimize the plan this way.)

Situation

We feel we've exhausted our options, but decided to turn to the hivemind to be sure. All we can come up with is:

(S1) Create a VIEW of the target table that also contains nullable columns for the data intended for the queue only, and have the SELECT statement define them as NULL. Then, setup INSTEAD OF triggers that populate both the target table and the queue appropriately. Finally, wire the MERGE to target the view. This works, but we're not fans of the construct -- it definitely looks tricky.

(S2) Give up, buffer the entire dataset in a temporary table using another MERGE..OUTPUT. After the MERGE, immediately copy the data (again!) from temporary table into the queue.

tne
  • 7,071
  • 2
  • 45
  • 68
  • 3
    I'll give this some thought, but regardless, really really suggest that you not turn to triggers as a solution. – Tom H Dec 30 '15 at 16:30
  • @TomH Thank you for your time; indeed we're hoping so as well. – tne Dec 30 '15 at 16:32
  • Personally if I were trying to solve this problem I'd reconsider requirement number 3. While I agree you probably don't want to have all data duplicated, its the best way to ensure you caught EXACTLY what was done. Why not just adding the Keys for the records to the output, enough to accurately re-fetch data from the source table? – Brad D Dec 30 '15 at 18:58
  • @BradD We're definitely considering (S2), and if we do I think we'd go all-in and capture all the data we need to queue up, not just the key (we'd have to test of course, but I strongly suspect refetching from the target would be far slower than copying just what we need redundantly -- this table is a pain). More importantly, there's an obvious concurrency issue if we refetch, and we can't reasonably wrap these two operations in a single transaction. – tne Dec 30 '15 at 19:04
  • I don't think I would re-fetch from Target, as much as the source. Source under most normal circumstances is going to be Of equal or lesser size than target after Merge. I agree you don't want to look through a mountain for small subset. I would test doing the Merge with the Output inclusive of Keys, then do an Insert into queue with a Select from source Join output table on keys. If it follows best practice on merge technique you should have the proper indexes in place already to make that a speedy operation. – Brad D Dec 30 '15 at 19:17
  • @BradD Thanks for reiterating! For some reason I read "target" even though you were very clear about fetching from the **source** the first time. You're definitely right, we're paying a very similar cost for the statistics anyway and with the right index a `JOIN` should be just fine (and no concurrency issue, staging doesn't move). I suppose because I don't like the way the statistics are gathered I see this as "making worse use of an existing hack" and that is the only reason I was a little dismissive of the idea in req#3, and still am to some extent. Could be the best compromise though. – tne Dec 30 '15 at 19:33
  • 1
    @tne - I wouldn't look at it as a hack anymore if there is opportunity to make it more purposeful. In this case re-using the current output is now extensible, which is a good thing. You will be guaranteed to have the exact/consistent working set for anything else that may come up. – Brad D Dec 30 '15 at 19:43
  • @BradD Well, the way I see it, the statistics *could* be aggregated on the fly with three 64-bit buffers to be generous; instead we buffer millions of small and highly redundant records just to reduce them at the end. Similarly, the data to be queued *could* be streamed right into the queue; instead we'd buffer millions of keys just to make a join&copy at the end. So yes, "hack" is relativistic and I'll definitely change my perspective if it turns out to be the best thing we can do, but in the grand scheme of things, objectively, it's not ideal I don't think. – tne Dec 30 '15 at 19:52
  • Your failure (a) implies that it would be possible to have two `output into` anyway. This isn't the case. – Martin Smith Jan 01 '16 at 22:47
  • @MartinSmith Yes indeed, I could remove that from the post but just in case somebody thinks of something we didn't regarding the stat-gathering requirement, I believe it becomes relevant. – tne Jan 01 '16 at 22:57
  • Regarding point 5 you could just use the `OUTPUT` clause inside a CLR object and process the result in a streaming fashion on the server. No need to send it across a network. But I'd just chuck it all into a temp table except if this was a proven bottleneck. – Martin Smith Jan 01 '16 at 23:00
  • @MartinSmith I like the idea of using a CLR object, it never actually occurred to us that it could give us the extra flexibility. Come to think of it now, even a cursor could suffice. We'd have to do the statistics gathering there too and remove the initial `OUTPUT..INTO` clause, but we didn't like it very much to begin with so that would seem to be a win. Only thing is, it conflicts with the principle of favoring set-based approaches (like other solutions), though it's becoming clear there are no builtin constructs for the behavior we'd like, so that might be justified. Or go with "S2", yes. – tne Jan 02 '16 at 01:51
  • The only way of doing what you want to do would be if SQL Server allowed you to hand craft an execution plan (it doesn't). The operations allowed on the `OUTPUT` stream are extremely limited so there's no way of expressing it in SQL. – Martin Smith Jan 02 '16 at 11:41
  • @MartinSmith, doesn't `OPTION (USE PLAN N'xml_plan')` allow it? – i-one Jan 02 '16 at 13:05
  • @i-one no. That still has to result in a plan that the optimiser can come up with on its own and validate is correct for the given SQL. – Martin Smith Jan 02 '16 at 13:07
  • @tne: In each execution, what are the likely number of inserts and updates? What are these as a %-age of the target table? ANY answer to this question will be guesswork without knowing these numbers. Also, have you looked at using OUTPUT (not INTO) and wrapping the whole MERGE as a nested statement? That allows you to specify where (and pivot!) logic on it. Finally, note that MERGE often performs much more poorly than independant INSERTs and UPDATEs as it is much harder (impossible?) to get it to use simple logging. – dsz Jan 02 '16 at 23:32
  • @dsz You're right; it varies a lot from batch to batch though. Target is >70mil, grows and has to scale. The process is really a sync process (hence the merge) and some organizations send differential batches (insertions or updates) while some others send full snapshots (lot of unmatched records). Sizes differ, some only hold a couple hundred thousand records, some hold millions by themselves. 50+ orgs. Import frequency is sometimes weekly, sometimes daily, sometimes even more frequent than that (depending on the org). – tne Jan 03 '16 at 14:08
  • @dsz Yes, we've looked at `OUTPUT` and attempted what you describe in (c). Note that there are two "output streams" (stats and inserted-queue) so to do what you say we'd have to break (6); which it's becoming clear we'll probably do anyway (it was probably naive of us to think we could avoid it, but I personally found the question interesting). Vladimir Baranov has a rather creative solution below, though mostly academic as we perceive it. Else we have to do more per-row processing (e.g. CLR-based per Martin Smith comment or trigger-based per S1). – tne Jan 03 '16 at 14:08
  • Could you use SSIS with a Lookup splitting Matched and Non Matched items. The Matched ones could feed into a simple Update SQL Command and the Non Matched could be split using a multicast transformation to send inserts to both of your tables – Steve Ford Jan 04 '16 at 17:27
  • @SteveFord Yes! This was vaguely mentioned by dsz in the comments of his answer. I suspect this would end up being an equivalent solution to one of those suggested here (likely Y.B.'s), it'd be interesting to compare. While we're actually using SSIS on a distinct host for the part of the process, the result is actually what is loaded in the staging table, which is resides in an Azure SQL Database (which doesn't yet support integration services). Do you know if it's possible to properly manipulate data flows from a different host without local processing? – tne Jan 04 '16 at 18:14
  • @tne `OPENQUERY` and `EXEC AT` on linked server can process data on a remote host returning just the result set. – Y.B. Jan 14 '16 at 13:01
  • @Y.B. Thanks. Indeed, though I was actually wondering if SSIS "data flows" are able to take advantage of those kinds of features efficiently. Thinking about it again, I figure it's probably worthy of a separate question. – tne Jan 14 '16 at 13:26

6 Answers6

17

My understanding is that the main obstacle is the limitation of the OUTPUT clause in SQL Server. It allows one OUTPUT INTO table and/or one OUTPUT that returns result set to the caller.

You want to save the outcome of the MERGE statement in two different ways:

  • all rows that were affected by MERGE for gathering statistics
  • only inserted rows for queue

Simple variant

I would use your S2 solution. At least to start with. It is easy to understand and maintain and should be quite efficient, because the most resource-intensive operation (MERGE into Target itself would be performed only once). There is a second variant below and it would be interesting to compare their performance on real data.

So:

  • Use OUTPUT INTO @TempTable in the MERGE
  • Either INSERT all rows from @TempTable into Stats or aggregate before inserting. If all you need is aggregated statistics, it makes sense to aggregate results of this batch and merge it into the final Stats instead of copying all rows.
  • INSERT into Queue only "inserted" rows from @TempTable.

I'll take sample data from the answer by @i-one.

Schema

-- I'll return to commented lines later

CREATE TABLE [dbo].[TestTarget](
    -- [ID] [int] IDENTITY(1,1) NOT NULL,
    [foo] [varchar](10) NULL,
    [bar] [varchar](10) NULL
);

CREATE TABLE [dbo].[TestStaging](
    [foo] [varchar](10) NULL,
    [bar] [varchar](10) NULL,
    [baz] [varchar](10) NULL
);

CREATE TABLE [dbo].[TestStats](
    [MergeAction] [nvarchar](10) NOT NULL
);

CREATE TABLE [dbo].[TestQueue](
    -- [TargetID] [int] NOT NULL,
    [foo] [varchar](10) NULL,
    [baz] [varchar](10) NULL
);

Sample data

TRUNCATE TABLE [dbo].[TestTarget];
TRUNCATE TABLE [dbo].[TestStaging];
TRUNCATE TABLE [dbo].[TestStats];
TRUNCATE TABLE [dbo].[TestQueue];

INSERT INTO [dbo].[TestStaging]
    ([foo]
    ,[bar]
    ,[baz])
VALUES
    ('A', 'AA', 'AAA'),
    ('B', 'BB', 'BBB'),
    ('C', 'CC', 'CCC');

INSERT INTO [dbo].[TestTarget]
    ([foo]
    ,[bar])
VALUES
    ('A', 'A_'),
    ('B', 'B?');

Merge

DECLARE @TempTable TABLE (
    MergeAction nvarchar(10) NOT NULL,
    foo varchar(10) NULL,
    baz varchar(10) NULL);

MERGE INTO TestTarget AS Dst
USING TestStaging AS Src
ON Dst.foo = Src.foo
WHEN MATCHED THEN
UPDATE SET
    Dst.bar = Src.bar
WHEN NOT MATCHED BY TARGET THEN
INSERT (foo, bar)
VALUES (Src.foo, Src.bar)
OUTPUT $action AS MergeAction, inserted.foo, Src.baz
INTO @TempTable(MergeAction, foo, baz)
;

INSERT INTO [dbo].[TestStats] (MergeAction)
SELECT T.MergeAction
FROM @TempTable AS T;

INSERT INTO [dbo].[TestQueue]
    ([foo]
    ,[baz])
SELECT
    T.foo
    ,T.baz
FROM @TempTable AS T
WHERE T.MergeAction = 'INSERT'
;

SELECT * FROM [dbo].[TestTarget];
SELECT * FROM [dbo].[TestStats];
SELECT * FROM [dbo].[TestQueue];

Result

TestTarget
+-----+-----+
| foo | bar |
+-----+-----+
| A   | AA  |
| B   | BB  |
| C   | CC  |
+-----+-----+

TestStats
+-------------+
| MergeAction |
+-------------+
| INSERT      |
| UPDATE      |
| UPDATE      |
+-------------+

TestQueue
+-----+-----+
| foo | baz |
+-----+-----+
| C   | CCC |
+-----+-----+

Second variant

Tested on SQL Server 2014 Express.

OUTPUT clause can send its result set to a table and to the caller. So, OUTPUT INTO can go into the Stats directly and if we wrap the MERGE statement into a stored procedure, then we can use INSERT ... EXEC into the Queue.

If you examine execution plan you'll see that INSERT ... EXEC creates a temporary table behind the scenes anyway (see also The Hidden Costs of INSERT EXEC by Adam Machanic), so I expect that overall performance would be similar to the first variant when you create temporary table explicitly.

One more problem to solve: Queue table should have only "inserted" rows, not all effected rows. To achieve that you could use a trigger on the Queue table to discard rows other than "inserted". One more possibility is to define a unique index with IGNORE_DUP_KEY = ON and prepare the data in such a way that "non-inserted" rows would violate the unique index and would not be inserted into the table.

So, I'll add an ID IDENTITY column to the Target table and I'll add a TargetID column to the Queue table. (Uncomment them in the script above). Also, I'll add an index to the Queue table:

CREATE UNIQUE NONCLUSTERED INDEX [IX_TargetID] ON [dbo].[TestQueue]
(
    [TargetID] ASC
) WITH (
PAD_INDEX = OFF, 
STATISTICS_NORECOMPUTE = OFF, 
SORT_IN_TEMPDB = OFF, 
IGNORE_DUP_KEY = ON, 
DROP_EXISTING = OFF, 
ONLINE = OFF, 
ALLOW_ROW_LOCKS = ON, 
ALLOW_PAGE_LOCKS = ON)

Important part is UNIQUE and IGNORE_DUP_KEY = ON.

Here is the stored procedure for the MERGE:

CREATE PROCEDURE [dbo].[TestMerge]
AS
BEGIN
    SET NOCOUNT ON;
    SET XACT_ABORT ON;

    MERGE INTO dbo.TestTarget AS Dst
    USING dbo.TestStaging AS Src
    ON Dst.foo = Src.foo
    WHEN MATCHED THEN
    UPDATE SET
        Dst.bar = Src.bar
    WHEN NOT MATCHED BY TARGET THEN
    INSERT (foo, bar)
    VALUES (Src.foo, Src.bar)
    OUTPUT $action INTO dbo.TestStats(MergeAction)
    OUTPUT CASE WHEN $action = 'INSERT' THEN inserted.ID ELSE 0 END AS TargetID, 
    inserted.foo,
    Src.baz
    ;

END

Usage

TRUNCATE TABLE [dbo].[TestTarget];
TRUNCATE TABLE [dbo].[TestStaging];
TRUNCATE TABLE [dbo].[TestStats];
TRUNCATE TABLE [dbo].[TestQueue];

-- Make sure that `Queue` has one special row with TargetID=0 in advance.
INSERT INTO [dbo].[TestQueue]
    ([TargetID]
    ,[foo]
    ,[baz])
VALUES
    (0
    ,NULL
    ,NULL);

INSERT INTO [dbo].[TestStaging]
    ([foo]
    ,[bar]
    ,[baz])
VALUES
    ('A', 'AA', 'AAA'),
    ('B', 'BB', 'BBB'),
    ('C', 'CC', 'CCC');

INSERT INTO [dbo].[TestTarget]
    ([foo]
    ,[bar])
VALUES
    ('A', 'A_'),
    ('B', 'B?');

INSERT INTO [dbo].[TestQueue]
EXEC [dbo].[TestMerge];

SELECT * FROM [dbo].[TestTarget];
SELECT * FROM [dbo].[TestStats];
SELECT * FROM [dbo].[TestQueue];

Result

TestTarget
+----+-----+-----+
| ID | foo | bar |
+----+-----+-----+
|  1 | A   | AA  |
|  2 | B   | BB  |
|  3 | C   | CC  |
+----+-----+-----+

TestStats
+-------------+
| MergeAction |
+-------------+
| INSERT      |
| UPDATE      |
| UPDATE      |
+-------------+

TestQueue
+----------+------+------+
| TargetID | foo  | baz  |
+----------+------+------+
|        0 | NULL | NULL |
|        3 | C    | CCC  |
+----------+------+------+

There will be an extra message during INSERT ... EXEC:

Duplicate key was ignored.

if MERGE updated some rows. This warning message is sent when unique index discards some rows during INSERT due to IGNORE_DUP_KEY = ON.

A warning message will occur when duplicate key values are inserted into a unique index. Only the rows violating the uniqueness constraint will fail.

Community
  • 1
  • 1
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • 1
    Your use of `IGNORE_DUP_KEY` to solve (a) is pretty damn creative, just like your use of `INSERT..EXEC` to workaround (c). The necessity for a dummy record and for generating and ignoring warning messages is of course a little obscure, and perhaps for these reasons we're still leaning toward S2/simple-variant in the end (as you also recommend), but your "second variant" definitely answers the question (and particularly its rather academic aspects, let's be honest). Good answer! – tne Jan 02 '16 at 14:47
  • Thank you. You are right, I wrote the second variant just for the sake of it. It's too obscure to put it into production. From the academic point of view the existing limitations of `OUTPUT` in SQL Server can be less strict. If I'm not mistaken, in Postgres it is possible to nest/chain several `RETURNING` clauses using `WITH` (maybe because Postgres materializes CTE). But, Postgres doesn't have `MERGE` (at least didn't have until 9.5). Still, even in Postgres it would work using temporary tables behind the scenes, so I'd prefer to create temporary table myself and have full control over it. – Vladimir Baranov Jan 03 '16 at 04:47
  • I'm not sure if I follow this solution completely, so sorry if this is something obviously wrong, but could you use a `NullIf` instead of the `Case` statement to dump the non-inserted rows entirely? – Kit Z. Fox Jan 05 '16 at 15:08
  • @KitZ.Fox, if you try to insert `NULL` into `TestQueue.TargetID` the whole `INSERT` would fail. If you try to insert 0 value into `TestQueue.TargetID` only rows that have 0 would be skipped due to `IGNORE_DUP_KEY` (and due to having a special row with 0 in the table in advance), while the rest of the rows would be inserted. Play with the scripts from the answer and see for yourself. – Vladimir Baranov Jan 05 '16 at 23:14
  • 2
    @VladimirBaranov: IGNORE_DUP_KEY was the first thing I have thought after reading all OP requirements. Don't analyze anything just throw everything queue table. OP can even bulk insert - that will be faster than `INSERT ... NOT EXIST(..)` – Alex Yu Jan 08 '16 at 19:56
7

Consider following two approaches to solve the problem:

  • Merge data into target and output inserted into queue in a single statement, and summarize statistics in the trigger created on target. Batch identifier can be passed into trigger via temporary table.
  • Merge data into target and output inserted into queue in a single statement, and summarize statistics immediately after the merge, using built-in change tracking capabilities, instead of doing it in the trigger.

Approach 1 (merge data and gather statistics in the trigger):

Sample data setup (indexes and constraints omitted for simplicity):

create table staging (foo varchar(10), bar varchar(10), baz varchar(10));
create table target (foo varchar(10), bar varchar(10));
create table queue (foo varchar(10), baz varchar(10));
create table stats (batchID int, inserted bigint, updated bigint, deleted bigint);

insert into staging values
    ('A', 'AA', 'AAA')
    ,('B', 'BB', 'BBB')
    ,('C', 'CC', 'CCC')
    ;

insert into target values
    ('A', 'A_')
    ,('B', 'B?')
    ,('E', 'EE')
    ;

Trigger for gathering inserted/updated/deleted statistics:

create trigger target_onChange
on target
after delete, update, insert
as
begin
    set nocount on;

    if object_id('tempdb..#targetMergeBatch') is NULL
        return;

    declare @batchID int;
    select @batchID = batchID from #targetMergeBatch;

    merge into stats t
    using (
        select
            batchID = @batchID,
            cntIns = count_big(case when i.foo is not NULL and d.foo is NULL then 1 end),
            cntUpd = count_big(case when i.foo is not NULL and d.foo is not NULL then 1 end),
            cntDel = count_big(case when i.foo is NULL and d.foo is not NULL then 1 end)
        from inserted i
            full join deleted d on d.foo = i.foo
    ) s
    on t.batchID = s.batchID
    when matched then
        update
        set
            t.inserted = t.inserted + s.cntIns,
            t.updated = t.updated + s.cntUpd,
            t.deleted = t.deleted + s.cntDel
    when not matched then
        insert (batchID, inserted, updated, deleted)
        values (s.batchID, s.cntIns, s.cntUpd, cntDel);

end

Merge statements:

declare @batchID int;
set @batchID = 1;-- or select @batchID = batchID from ...;

create table #targetMergeBatch (batchID int);
insert into #targetMergeBatch (batchID) values (@batchID);

insert into queue (foo, baz)
select foo, baz
from
(
    merge into target t
    using staging s
    on t.foo = s.foo
    when matched then
        update
        set t.bar = s.bar
    when not matched then
        insert (foo, bar)
        values (s.foo, s.bar)
    when not matched by source then
        delete
    output $action, inserted.foo, s.baz
) m(act, foo, baz)
where act = 'INSERT'
    ;

drop table #targetMergeBatch

Check the results:

select * from target;
select * from queue;
select * from stats;

Target:

foo        bar
---------- ----------
A          AA
B          BB
C          CC

Queue:

foo        baz
---------- ----------
C          CCC

Stats:

batchID  inserted   updated   deleted
-------- ---------- --------- ---------
1        1          2         1

Approach 2 (gather statistics, using change tracking capabilities):

Sample data setup is the same as in previous case (just drop everything incl. trigger and recreate tables from scratch), except that in this case we need to have PK on target to make sample work:

create table target (foo varchar(10) primary key, bar varchar(10));

Enable change tracking on database:

alter database Test
    set change_tracking = on

Enable change tracking on target table:

alter table target
    enable change_tracking

Merge data and grab statistics immediately after that, filtering by the change context to count only rows affected by merge:

begin transaction;
declare @batchID int, @chVersion bigint, @chContext varbinary(128);
set @batchID = 1;-- or select @batchID = batchID from ...;
SET @chVersion = change_tracking_current_version();
set @chContext = newid();

with change_tracking_context(@chContext)
insert into queue (foo, baz)
select foo, baz
from
(
    merge into target t
    using staging s
    on t.foo = s.foo
    when matched then
        update
        set t.bar = s.bar
    when not matched then
        insert (foo, bar)
        values (s.foo, s.bar)
    when not matched by source then
        delete
    output $action, inserted.foo, s.baz
) m(act, foo, baz)
where act = 'INSERT'
    ;

with ch(foo, op) as (
    select foo, sys_change_operation
    from changetable(changes target, @chVersion) ct
    where sys_change_context = @chContext
)
insert into stats (batchID, inserted, updated, deleted)
select @batchID, [I], [U], [D]
from ch
    pivot(count_big(foo) for op in ([I], [U], [D])) pvt
    ;

commit transaction;

Check the results:

select * from target;
select * from queue;
select * from stats;

They are same as in previous sample.

Target:

foo        bar
---------- ----------
A          AA
B          BB
C          CC

Queue:

foo        baz
---------- ----------
C          CCC

Stats:

batchID  inserted   updated   deleted
-------- ---------- --------- ---------
1        1          2         1
i-one
  • 5,050
  • 1
  • 28
  • 40
  • Now try that with another unrelated `OUTPUT..INTO` clause. It's obscure. I do agree with the idea though, that's a very attractive solution which we already explored in (c), but it's incompatible with our requirement #3 (as far as we can see). [Relevant discussion](http://stackoverflow.com/q/17165870). – tne Dec 31 '15 at 01:25
  • @tne, got it (after re-reading thoroughly). In this case, in my opinion, go S1 way, or go this way and add trigger on the target to grab inserted/deleted/updated statistics. – i-one Jan 01 '16 at 20:05
  • Thanks for the stat-gathering-with-trigger idea, we actually hadn't considered it. Those statistics are per "import batch" however and the target table has no notion of which batch the records came from. This means we'd have to use a dummy "batch ID" column only used once by the trigger so that we can retrieve the statistics only for the relevant batch after the merge. Trying to solve that issue always leads back to S1 AFAICT (which is really just a workaround for the fact that `MERGE` cannot target a sproc at the end of the day). – tne Jan 01 '16 at 21:25
  • @tne, _"the target table has no notion of which batch the records came from"_ - this matters. In this case, yes it leads to S1. – i-one Jan 02 '16 at 13:26
  • Point of interest: @dsz suggests using `CONTEXT_INFO` to solve that dilemma without falling back to "S1" (and keeping the triggers on the target). – tne Jan 04 '16 at 12:19
  • @tne, it depends on what "batch ID" is. Is it scalar value during the merge statement, or is it taken (or calculated somehow) from staging table rows? Scalar can be passed via `context_info`. However, I usually staying away from `context_info` where possible, because of it is limited in length and can be spoiled by other processes. – i-one Jan 04 '16 at 12:55
  • @tne The other option is that temporary table `#targetMergeBatch` is created before merge, which is filled with "batch ID", then merge happens, trigger fires on target and takes "batch ID" from `#targetMergeBatch` and gathers stats also. Of course trigger should check existence of `#targetMergeBatch` to not crash on its absence. – i-one Jan 04 '16 at 12:56
  • It is indeed a simple scalar value (integer, present in the staging table) that would fit the bill. I also agree that it has its own downsides, as we discuss in the comment thread of the relevant answer. Using a *local* temporary table looks like the proper generalized solution for the `CONTEXT_INFO` approach (it's probably obvious to everyone, but neither global temporary tables nor table variables would be appropriate). Thanks for that! Not sure your last sentence is as generally applicable though; if we have no recovery path then we'd prefer to crash early I reckon. – tne Jan 04 '16 at 13:27
  • @tne, I meant, with trigger relying on temp table existence, if I write statement `delete from target where foo = 'A'` just to delete test row that I just created, the trigger will crash. I don't think it is acceptable, except if target is modified strictly by merge process and not by others. – i-one Jan 04 '16 at 13:41
  • Oh, that! "Of course" (which is what you said; gotcha). – tne Jan 04 '16 at 13:59
  • 1
    Could you use this idea, utilizing the lightweight change tracking to capture the stats from the CHANGETABLE. This will of course only work if you can guarantee that only one import will be running at a time and no other processes are updating the target. – Steve Ford Jan 05 '16 at 09:50
  • @SteveFord, Thank you! Definitely, I learned the new thing, I wonder how I was living without knowing about this capability. I found that it is possible to filter out only own changes by specifying `with change_tracking_context` option. – i-one Jan 05 '16 at 16:54
  • @tne, see the second approach added to answer, based on hint by SteveFord. – i-one Jan 05 '16 at 16:58
  • Very nice; thank you both. Again this is quite a creative way to sidestep the merge output clause limitations, and it also offers higher cohesion than trigger-based solutions (all the relevant code is grouped appropriately, making it easier to understand the big picture). – tne Jan 06 '16 at 09:56
5

I suggest extracting the stats be coding using three independent AFTER INSERT / DELETE / UPDATE triggers along the lines of:

create trigger dbo.insert_trigger_target
on [dbo].[target]
after insert
as
insert into dbo.[stats] ([action],[count])
select 'insert', count(1)
from inserted;
go

create trigger dbo.update_trigger_target
on [dbo].[target]
after update
as
insert into dbo.[stats] ([action],[count])
select 'update', count(1) from inserted -- or deleted == after / before image, count will be the same
go

create trigger dbo.delete_trigger_target
on [dbo].[target]
after delete
as
insert into dbo.[stats] ([action],[count])
select 'delete', count(1) from deleted
go

If you need more context, put something in CONTEXT_INFO and pluck it out from the triggers.

Now, I'm going to assert that the AFTER triggers are not that expensive, but you'll need to test that to be sure.

Having dealt with that, you'll be free to use the OUTPUT clause (NOT OUTPUT INTO) in the MERGE and then use that nested inside a select to subset the data that you want to go into the queue table.

Justification

Because of the need to access columns from both staging and target in order to build the data for queue, this HAS to be done using the OUTPUT option in MERGE, since nothing else has access to "both sides".

Then, if we have hijacked the OUTPUT clause for queue, how can we re-work that functionality? I think the AFTER triggers will work, given the requirements for stats that you have described. Indeed, the stats could be quite complex if required, given the images that are available. I'm asserting that the AFTER triggers are "not that expensive" since the data of both before and after must always be available in order that a transaction can be both COMMITTED OR ROLLED BACK - yes, the data needs to be scanned (even to get the count) but that doesn't seem like too much of a cost.

In my own analysis that scan added about 5% to the execution plan's base cost

Sound like a solution?

dsz
  • 4,542
  • 39
  • 35
  • Yes, in fact. @i-one suggested the very same thing in a comment on their post, and we initially dismissed it at the per-batch stats req (which wasn't very explicit in the OP). You add a little nugget on top in passing, though that's the critical part that actually makes the approach valid: the use of `CONTEXT_INFO` (for the batch ID and indeed whether it's part of a relevant import batch or not). It makes the coupling a little long-range but it would definitely work AFAICT. Hadn't thought of that, thanks! A quick exec-plan shows it streams filtered records properly into the queue, too. – tne Jan 04 '16 at 10:07
  • I think the solution is most relevantly compared to the CLR-based or cursor-based approaches of aggregating statistics and streaming insertions; though it seems both of these would generate one `INSERT` stmt per row to queue up, intuitively making them way less efficient (not tested yet!) than this. Because everything is a trade-off, it seems they'd be more maintainable/less obscure however (no need for "long-range" `CONTEXT_INFO`). – tne Jan 04 '16 at 10:08
  • I've used a LOT of SQL-CLR in my time, and it has some great tricks up its sleeves! However, it introduces its own complexities such that I'd opt for the `CONTEXT_INFO` hand-shake any time. I'm also reasonable confident that it will out-perform CLR in this scenario - and I'd love to see the results of any comparison you run! The only other option I can think of would be integration services - but that is it's own set of hassles. – dsz Jan 04 '16 at 10:58
  • Your experience is greatly appreciated. We actually use SSIS as part of the larger process, so it isn't so far-fetched for us. It was always our understanding though that it cannot do something you wouldn't be able to do in T-SQL or using CLR APIs since, we think, it is "merely" a high-level ("graphical") language that translates to T-SQL or CLR APIs at the end of the day. Is this false? IOW, is SSIS able to leverage internal APIs for some of its built-in tasks? (It does seem at least plausible if the package runs on the same server, but still a little far-fetched in my limited understanding.) – tne Jan 04 '16 at 12:49
  • RE testing: you gave us quite a bit to think about (as did others) and it pains me to say that it's looking like we're going to have to choose the most predictable solution because of the dreaded time constraints. I don't know if I'll be able to test every variant of every approach in a meaningful context, or it will have to be on my own time in a "synthetic environment" with possibly misleading results (or requiring more extensive analysis work), for the fun of it. Thus, no promises, but if it happens I'll be sure to share here. – tne Jan 04 '16 at 12:49
  • SSIS makes it a breeze to split / cone the data streams and make it usable for multiple purposes, with some reasonable approaches to buffering. I think about SSIS whenever I'm faced with that kind of issue. And STILL never ended up using it. (More by prejudice and a couple of bad encounters) But it's pretty easy to map map out a MERGE-style task, and then you could add splitters in the appropriate places. But that's still a big task to then find out it doesn't perform as well as you would have hoped for. – dsz Jan 04 '16 at 13:48
  • What SSIS does allow for a multi-threaded solution where data readers and writers are talking to each other between threads. That's a LOT of C#, and CLR can't do multi-threading, so that's where we get to SSIS I think. – dsz Jan 04 '16 at 13:50
  • I'm pretty sure CLR code [can be multi-threaded](https://msdn.microsoft.com/en-us/library/ms131047) in MSSQL environments, though it has to be mindful of the platform's restrictions. I thus still hold the belief SSIS can't do anything we cannot do "manually" with T-SQL&CLR, though as you say it makes some basic patterns very easy to setup. Your approach toward it makes a lot of sense to us by the way; we've been bitten by corner cases more times than we care to admit. Any "non-trivial" pattern tends to move to crammed "script tasks" by necessity and we end up wondering why we use SSIS at all. – tne Jan 04 '16 at 14:12
  • For CLR, they say non-preemptive multi-threading. I'd never been sure if that was "code" for fibers - one execution thread being shared between several "logical" threads. – dsz Jan 04 '16 at 15:54
  • Good point, it certainly "smells" like it, though reading it carefully it clearly says that preemptive TS *is* supported. The documentation simply warns that using it inappropriately can "degrade the scalability of the system", undoubtedly because two task schedulers would be used without much cooperation (leading to, I assume, potential resource starving scenarios). Haven't read further and this is drifting a little off-topic so I'll stop here, but I guess we can conclude by "if someone is doing preemptive MT in MSSQL, they know what they're doing anyway". – tne Jan 04 '16 at 16:36
3

Have you considered ditching the merge and just doing an insert where not exists and an update? You could then use the output clause from the insert to populate your queue table.

gordy
  • 9,360
  • 1
  • 31
  • 43
  • This sidesteps the question a little bit, but we'll definitely think more about it now. Mostly it has to do with (4); doing it in two operations would mean doing the same target scans twice and in my estimation would be "too wasteful" (this is really the reason why `MERGE` exists in the first place). It's still a very good suggestion though, because clearly we have to ask exactly what "too wasteful" means and then measure it to be sure; we might get surprised. Also wondering about the tx size or whether we can safely do both ops separately. Ditching `MERGE` would also help with stat gathering. – tne Jan 01 '16 at 22:40
  • @tne definitely measure, I think merge would only be more efficient if it's actually a table scan on your target - and in that case you may want to consider eliminating the scan – gordy Jan 02 '16 at 00:33
  • It's definitely a table scan, one that we have no clear way of eliminating -- or so we thought. Your comment forced me to consider this more and we can probably use an indexed checksum to replace some of the matching conditions. This can help either way; thanks for the nudge! – tne Jan 02 '16 at 01:58
3

Import through a staging table might be more efficient with sequential rather then set-oriented processing. I would consider rewriting MERGE into a stored procedure with cursor scan. Then for each record you can have as many outputs as you like plus any counts without pivot at a total cost of one staging table scan.

Stored procedure might also provide opportunities to split processing into smaller transactions whereas triggers on bigger data sets might lead to transaction log overflow.

Y.B.
  • 3,526
  • 14
  • 24
  • We thought of a similar approach capturing the `MERGE` output for such sequential processing by placing it in a nested `SELECT` statement; @MartinSmith even suggested using a CLR object in the OP comments. Your post prompted me to actually test that solution, since it seems to leverage `MERGE` nicely (the matching logic is quite declarative) whilst providing the same benefits. Turns out: `A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement.` Cursors are excluded. Shame, it seems arbitrary. – tne Jan 04 '16 at 11:34
  • Thus, it looks like your solution is actually the only one that works for this approach. It requires reimplementing the `MERGE` logic imperatively, but it is relatively straightforward and should be clean code with obvious behavior to the maintainer (probably more so than any other answer). It'd be interesting to compare it to @dsz's approach in terms of performance. Thanks! – tne Jan 04 '16 at 11:35
  • 1
    @tne Cursor scan of `MERGE` output would not have been an efficient solution anyway: we would just overlay set-oriented processing with sequential scan. As for performance that would really depend on number of `JOIN`s involved in this `MERGE`. Stored procedure can also provide opportunities to split processing into smaller transactions whereas `TRIGGER`s on bigger data sets might lead to transaction log overflow. – Y.B. Jan 04 '16 at 12:36
  • I don't believe it would have been *less* efficient though, the idea was mostly to leverage the ~declarative syntax. Transaction size management is an *excellent* point though and one that is always creeping up on the back of our minds when it comes to the scalability of this system (so far so good, but I love to be ready). This way we have very fine-grained control over it. – tne Jan 04 '16 at 12:44
2

Unless I'm missing something, a simple insert command should meet all your requirements.

insert into queue
(foo, baz)
select staging.foo, staging.baz
from staging join target on staging.foo = target.boo
where whatever

This would happen after the merge into target.

For new records only, do this before the merge

insert into queue
(foo, baz)
select staging.foo, staging.baz
from staging left join target on staging.foo = target.boo
where target.foo = null
Dan Bracuk
  • 20,699
  • 4
  • 26
  • 43
  • 1
    If we do that after the merge, we don't know what the merge inserted (requirement #1). If it isn't too clear: some records might have been updated instead, and we don't want to queue those up. – tne Dec 30 '15 at 17:29
  • Per your revision #2: Doing that before the merge essentially duplicates the matching logic of the merge. As I explained in req #4, the process takes several minutes (millions and millions of records, and it matches data that isn't indexed). It's not just theoretical here, because what you suggest would require wrapping the `INSERT` and the `MERGE` in a single transaction lest we risk races (what if the `target` is updated between the `INSERT` and the `MERGE`?). A single TX holding potentially millions of records for several minutes would blow up past reasonable timeouts. – tne Dec 30 '15 at 18:36