Using merge to combine matching records

Question

I'm trying to combine matching records from a table into a single record of another table. I know this can be done with group by, and sum(), max(), etc..., My difficulty is that the columns that are not part of the group by are varchars that i need to concatenate.

I'm using Sybase ASE 15, so I do not have a function like MySQL's group_concat or similar.

I tried using merge without luck, the target table ended with the same number of records of source table.

create table #source_t(account varchar(10), event varchar(10))

Insert into #source_t(account, event) values ('account1','event 1')
Insert into #source_t(account, event) values ('account1','event 2')
Insert into #source_t(account, event) values ('account1','event 3')

create table #target(account varchar(10), event_list varchar(2048))

merge into #target as t
    using #source_t as s
    on t.account = s.account
    when     matched then update set event_list = t.event_list + ' | ' + s.event
    when not matched then insert(account, event_list) values (s.account, s.event)

select * from #target

drop table #target

drop table #source_t

Considering the above tables, I wanted to have one record per account, with all the events of the account concatenated in the second column.

account, event_list
'account1', 'event 1 | event 2 | event 3'

However, all I've got is the same records as #source.

It seems to me that the match in merge is attempted against the "state" of the table at the beginning of statement execution, so the when matched never executes. Is there a way of telling the DBMS to match against the updated target table?

I managed to obtain the results I needed by using a cursor, so the merge statement is executed n times, n being the number of records in #source, thus the merge actually executes the when matched part.

The problem with it is the performance, removing duplicates this way takes about 5 minutes to combine 63K records into 42K.

Is there a faster way of achieving this?

5 minutes is rather excessive for just 63K rows; while log writes in tempdb should automatically be deferred, I'm wondering if your process is being slowed down by the large number of standalone single-transaction `update` statements; have you tried wrapping the cursor/merge loop in a single 'begin/commit tran' pair? the objective would be to limit the number of log writes which in turn may speed things up; the other consideration is that there's a performance issue with the `merge` statement, so wondering if ... — markp-fuso, Jun 29 '19 at 19:26
... you could replace the `merge` with your own if/then/else block that appends events for a common account into a variable, then when the account changes you flush the appended events for the 'last' account to #target (your cursor would need to have a 'order by account...' clause); the objective is to a) see if there's an issue with `merge` and reduce the number of writes to #target (in this case you'd be doing 42K inserts and no updates); you could test w/ and w/out a 'begin/commit tran' wrapper (around the cursor loop) to see if that makes any difference — markp-fuso, Jun 29 '19 at 19:31
It turned out I had overlooked creating the target table's primary key. Once I did that, execution time dropped to about 16 seconds. Adding begin/commit around the loop seems to drop a second or two. — Jahaziel, Jul 01 '19 at 15:13
ah, yeah, that (adding a PK/index to speed up lookups) can definitely make a diff; I've knocked 2 hrs off a 2.1 hr batch process by adding an index to a 100KB table ... gotta love those 'easy' fixes; anyhoo, good to hear you figured out the performance issue — markp-fuso, Jul 01 '19 at 15:37

markp-fuso · Answer 1 · 2019-06-29T17:22:06.130

There's a little known (poorly documented?) aspect of the UPDATE statement when using it to update a @variable which allows you to accumulate/concatenate values in the @variable as part of a set-based UPDATE operation.

This is easier to 'explain' with an example:

create table source
(account  varchar(10)
,event    varchar(10)
)
go

insert source values ('account1','event 1')
insert source values ('account1','event 2')
insert source values ('account1','event 3')

insert source values ('account2','event 1')

insert source values ('account3','event 1')
insert source values ('account3','event 2')
go

declare @account      varchar(10),
        @event_list   varchar(40)   -- increase the size to your expected max length 

select  @account = 'account1'

-- allow our UPDATE statement to cycle through the events for 'account1',
-- appending each successive event to @event_list

update  source
set     @event_list = @event_list + 
                      case when @event_list is not NULL then ' | ' end + 
                      event
from    source
where   account = @account

-- we'll display as a single-row result set; we could also use a 'print' statement ... 
-- just depends on what format the calling process is looking for

select  @account     as account,
        @event_list  as event_list
go

 account    event_list
 ---------- ----------------------------------------
 account1   event 1 | event 2 | event 3

PRO:

single UPDATE statement to process a single account value

CON:

still need a cursor to process a series of account values
if your desired final output is a single result set then you'll need to store intermediate results (eg, @account and @update) in a (temp) table, then run a final SELECT against this (temp) table to produce the desired result set
while you're not actually updating the physical table, you may run into problems if you don't have access to 'update' the table

NOTE: You could put the cursor/UPDATE logic in a stored proc, call the proc through a proxy table, and this would allow the output from a series of 'select @account,@update' statements to be returned to the calling process as a single result set ... but that's a whole 'nother topic on a (somewhat) convoluted coding method.

For your process you'll need a cursor to loop through your unique set of account values, but you'll be able to eliminate the cursor overhead for looping through the list of events for a given account. Net result is that you should see some improvement in the time it takes to run your process.

Interestingly, you answered what OP needs (i.e., equivalent of MySQL's `GROUP_CONCAT` or Postgres' `STRING_AGG` and more recently SQL Server 2017's `STRING_AGG()` in Sybase): https://stackoverflow.com/a/45092295/1422451 — Parfait, Jun 29 '19 at 03:07
@Parfait sure, but the UDF solution requires ASE 16 (OP is running ASE 15) and a relatively newer version that fixes some bugs with the earlier implementation of table variables; with some more finagling it *might* be possible to come up with a UDF for this OP/questoin but it would have to be hardcoded for this particular table (one of the benefits of a ASE 16's table variables => ability to build generic UDFs) — markp-fuso, Jun 29 '19 at 04:00
... and keep in mind that the ASE 16 / table variable solution requires a fairly convoluted implementation format, namely, passing a copy of the parent query as an argument to the UDF ... ***ugh*** — markp-fuso, Jun 29 '19 at 04:07
Indeed, I have resolved this kind of situation with a table specific UDF, however in this case, I need to run the query against a production database where I only have "select" privileges, so I can't create an UDF. Also, it is for a one-time report. — Jahaziel, Jun 29 '19 at 16:25
I Like the idea of looping through unique account values. However, the improvement would be more noticeable if the ratio of merged records vs source records was higher. In my case it is around 66%. If the unique account values was less than half the source records, performance improvement would be much more. — Jahaziel, Jun 29 '19 at 16:28
@Jahaziel yeah, the stats you gave suggested you may not get much of an improvement of processing at the account level (vs event level); something you may want to consider is to break up your account values in groups and then kick off multiple copies of your process against different groups of account values (ie, parallelize your operations); obviously you don't want to go to such extremes that you hog a noticeable amount of database resources! :-) — markp-fuso, Jun 29 '19 at 17:20

score 0 · Answer 2 · answered Jul 02 '19 at 15:45

After applying the given suggestions, and also speaking with our DBA, the winner idea was to ditch the merge and use logical conditions over the loop.

Adding begin/commit seemed to reduce execution time by 1.5 to 3 seconds.

Adding a primary key to the target table did the best reduction, reducing execution time to about 13 seconds.

Converting the merge to conditional logic was the best option in this case, achieving the result in about 8 seconds.

When using conditionals, the primary key in target table is detrimental by a small amount (around 1 sec), but having it drastically reduces time afterwards, since this table is only a previous step for a big join. (That is, the result of this record-merging is latter used in a join with 11+ tables.) So I kept the P.K.

Since there seems to be no solution without a cursor loop, I used the conditionals to merge the values using variables and issuing only the inserts to the target table, thus eliminating the need to seek a record to update or to check its existence.

Here is a simplified example.

create table #source_t(account varchar(10), event varchar(10));

Insert into #source_t(account, event) values ('account1','event 1');
Insert into #source_t(account, event) values ('account1','event 2');
Insert into #source_t(account, event) values ('account1','event 3');

Insert into #source_t(account, event) values ('account2','came');
Insert into #source_t(account, event) values ('account2','saw');
Insert into #source_t(account, event) values ('account2','conquered');

create table #target(
    account varchar(10), -- make primary key if the result is to be joined afterwards.
    event_list varchar(2048)
);

declare ciclo cursor for
select account, event
from #source_t c
order by account --,...
for read only;

declare @account varchar(10), @event varchar(40), @last_account varchar(10), @event_list varchar(1000)

open ciclo

fetch ciclo into @account, @event

set @last_account = @account, @event_list = null

begin tran

    while @@sqlstatus = 0 BEGIN 

        if @last_account <> @account begin  -- if current record's account is different from previous, insert into table the concatenated event string  
            insert into #target(account, event_list) values (@last_account, @event_list)        
            set @event_list = null -- Empty the string for the next account
        end

        set @last_account = @account -- Copy current account to the variable that holds the previous one
        set @event_list = case @event_list when null then @event else @event_list + ' | ' + @event end -- Concatenate events with separator

        fetch ciclo into @account, @event
    END

    -- after the last fetch, @@sqlstatus changes to <> 0, the values remain in the variables but the loop ends, leaving the last record unprocessed.
    insert into #target(account, event_list) values (@last_account, @event_list)

commit tran

close ciclo

deallocate cursor ciclo;

select * from #target;

drop table #target;

drop table #source_t;

Result:

account |event_list                 |
--------|---------------------------|
account1|event 1 | event 2 | event 3|
account2|saw | came | conquered     |

This code worked fast enough in my real use case. However it could be further optimized by filtering the source table to hold only the values que would be necessary for the join afterward. For that matter I saved the final joined resultset (minus the join with #target) in another temp table, leaving some columns blank. Then #source_t was filled using only the accounts present in the resultset, processed it to #target and finally used #target to update the final result. With all that, execution time for production environment was dropped to around 8 seconds (including all steps).

UDF Solutions have solved this kind of problem for me before but in ASE 15 they have to be table-specific and need to write one function per column. Also, that is only possible in development environment, not authorized for production because of read only privileges.

In conclusion, A cursor loop combined with a merge statement is a simple solution for combining records using concatenation of certain values. A primary key or an index including the columns used for the match is required to boost performance.

Conditional logic results in even better performance but comes at the penalty of more complex code (the more you code, the more prone to error).

Using merge to combine matching records

2 Answers2

Linked