Scenario
We have a pretty standard data import process in which we load a
staging
table, then MERGE
it into a target
table.
New requirements (green) involve capturing a subset of the imported data
into a separate queue
table for completely unrelated processing.
The "challenge"
(1) The subset consists of a selection of the records: those that were
newly inserted into the target
table only.
(2) The subset is a projection of some of the inserted columns, but also
at least one column that is only present in the source (the staging
table).
(3) The MERGE
statement already uses the OUTPUT..INTO
clause
strictly to record the $action
s taken by MERGE
, so that we can
PIVOT
the result and COUNT
the number of insertions, updates and
deletions for statistics purposes. We don't really enjoy buffering the
actions for the entire dataset like that and would prefer aggregating
the sums on the fly. Needless to say, we don't want to add more data to
this OUTPUT
table.
(4) We don't want to do the matching work that the MERGE
performs a second time for whatever reason, even partially. The
target
table is really big, we can't index everything, and the
operation is generally quite expensive (minutes, not seconds).
(5) We're not considering roundtripping any output from the MERGE
to
the client just so that the client can route it to the queue
by
sending it back immediately. The data has to stay on the server.
(6) We wish to avoid buffering the entire dataset in temporary storage
between staging
and the queue
.
What would be the best way of going about it?
Failures
(a) The requirement to enqueue only the inserted records prevents us
from targeting the queue
table directly in an OUTPUT..INTO
clause of
the MERGE
, as it doesn't allow any WHERE
clause. We can use some
CASE
trickery to mark the unwanted records for subsequent deletion
from the queue
without processing, but this seems crazy.
(b) Because some columns intended for the queue
don't appear in the
target
table, we cannot simply add an insertion trigger on the target
table to load the queue
. The "data flow split" has to happen sooner.
(c) Since we already use an OUTPUT..INTO
clause in the MERGE
, we
cannot add a second OUTPUT
clause and nest the MERGE
into an
INSERT..SELECT
to load the queue either. This is a shame, because it
feels like a completely arbitrary limitation for something that works
very well otherwise; the SELECT
filters only the records with the
$action
we want (INSERT
) and INSERT
s them in the queue
in a single
statement. Thus, the DBMS can theoretically avoid buffering the whole
dataset and simply stream it into the queue
. (Note: we didn't pursue
and it's likely that it actually didn't optimize the plan this way.)
Situation
We feel we've exhausted our options, but decided to turn to the hivemind to be sure. All we can come up with is:
(S1) Create a VIEW
of the target
table that also contains nullable
columns for the data intended for the queue
only, and have the
SELECT
statement define them as NULL
. Then, setup INSTEAD OF
triggers that populate both the target
table and the queue
appropriately. Finally, wire the MERGE
to target the view. This
works, but we're not fans of the construct -- it definitely
looks tricky.
(S2) Give up, buffer the entire dataset in a temporary table using
another MERGE..OUTPUT
. After the MERGE
, immediately copy the data
(again!) from temporary table into the queue
.