best way to "glue" columns together

Question

I need to combine columns from about 15 tables in one large table. Something the following works. But it takes very long to run while the CPU usage spikes to 100%, which causes concerns. Any suggestions will be highly appreciated.

declare @t1 table (empid int)
declare @t2 table (empid int, phone varchar(50))
declare @t3 table (empid int, license varchar(50))
declare @t4 table (empid int, email varchar(100))

insert into @t1 values (1)
insert into @t1 values (2)
insert into @t1 values (3)
insert into @t2 values (1, '5551234')
insert into @t2 values (2, '5553333')
insert into @t2 values (2, 'ttt2222')
insert into @t3 values (2, 'L4455')
insert into @t3 values (3, 'L7890')
insert into @t4 values (2, 'xxx@abc')

SELECT t1.empid, t2.phone, t3.license, t4.email
FROM
    @t1 t1
    LEFT OUTER JOIN
    (SELECT empid, phone, row_number() over (partition by empid order by phone) as rn 
    FROM @t2) t2 
    ON t2.empid = t1.empid
    FULL OUTER JOIN
    (SELECT empid, license, row_number() over (partition by empid order by license) as rn 
    FROM @t3) t3 
    ON t3.empid=t1.empid and (t2.rn is null or t3.rn = t2.rn)
    FULL OUTER JOIN
    (SELECT empid, email, row_number() over (partition by empid order by email) as rn 
    FROM @t4) t4
    ON t4.empid=t1.empid and t4.rn=coalesce(t2.rn, t3.rn) --image how long this coalesce clause is going to be for the 15th table?
order by t1.empid, t2.rn

Your example implies that the tables are only temporary - i.e. they are table variables. Is this the case in your actual implementation? CPU spiking briefly to 100% should not be a concern with SQL. Prolonged peaks are a different matter. You actually want the spike to use all CPU for as short a time as possible, to process your query as quickly as it can. — Greg the Incredulous, Dec 16 '14 at 23:11
Looks like a job for app code. Stored procs are a bad idea at the best of times. — Bohemian, Dec 16 '14 at 23:17
Can you explain why you're having to do ROW_NUMBER for each table and why you're joining on it in the last table in your sample? Part of the reason it takes a long time to run is having to do all the subqueries and order the data in each one of them. — Ashley Lee, Dec 17 '14 at 03:08
@Bohemian why are stored procs a bad idea? There are a number of valid reasons when they are exactly the right tool to use - e.g. for performance reasons, or when accessing the data for SSRS reports. If you use app code with the same logic as in the stored proc then it's likely you'll suffer the same performance problems, or worse. — Greg the Incredulous, Dec 17 '14 at 04:50
@greg actually, "better performance" and other "benefits" are myths. Read [this answer](http://stackoverflow.com/a/6369030/256196) for more detail in why. — Bohemian, Dec 17 '14 at 05:02
@Bohemian all the points in the link are valid, sometimes. I completely agree that it's a bad idea to hide any business logic in stored procs, but for selecting data there are lots of circumstances where it is the right way to get data. I would recommend benchmarking before making a sweeping statement about performance being a myth in all circumstances. — Greg the Incredulous, Dec 17 '14 at 05:30
@greg performance is a myth: SPs aren't scaleable (you can easily distribute the CPU load over lots of app servers, but there's only one DB server) if they calculate anything, and if they don't they should be converted to queries. If you "need" a SP, it's very likely you actually need a better data design instead. And not all "performance" is system - there's team performance too: hard to maintain means poor team team performance — Bohemian, Dec 17 '14 at 09:55
@Bohemian, usually the bottleneck on the SQL server is disk and network, not CPU. Imagine you have two tables: master and detail. You want to add a new row to master and at the same time a row to detail as well. Something like starting point for many details. You need to INSERT into master, get the auto-generated ID and INSERT into detail with this ID. If these three actions are packed into stored procedure you make only one call across the network and everything is done on the server. You don't transmit the generated ID back to the client. This is an example when SP has better performance. — Vladimir Baranov, Dec 17 '14 at 23:18

score 2 · Accepted Answer · edited Nov 13 '18 at 23:13

Your question is not really clear and it would be helpful if you included expected result. Let me guess what you want...

I'll give more meaningful names to the tables in your example and add few more rows to highlight the problem. In real life these tables would be real tables, of course, not variables, but I'll stick with variables to make this sample script easy to run and try. I'm using SQL Server 2008 for this example.

declare @TMain table (empid int);
declare @TPhones table (empid int, phone varchar(50));
declare @TLicenses table (empid int, license varchar(50));
declare @TEmails table (empid int, email varchar(100));

insert into @TMain values (1);
insert into @TMain values (2);
insert into @TMain values (3);
insert into @TMain values (4);

insert into @TPhones values (1, '5551234');
insert into @TPhones values (2, '5551111');
insert into @TPhones values (2, '5552222');
insert into @TPhones values (2, '5553333');
insert into @TPhones values (2, '5554444');

insert into @TLicenses values (2, 'L4455');
insert into @TLicenses values (3, 'L7890');

insert into @TEmails values (2, 'xxx@abc');
insert into @TEmails values (2, 'yyy@abc');
insert into @TEmails values (2, 'zzz@abc');

Simple variant

There is a fast, efficient and wrong naive approach:

SELECT
    Main.empid
    ,Phones.phone
    ,Licenses.license
    ,Emails.email
FROM
    @TMain AS Main
    LEFT JOIN @TPhones AS Phones ON Phones.empid = Main.empid
    LEFT JOIN @TLicenses AS Licenses ON Licenses.empid = Main.empid
    LEFT JOIN @TEmails AS Emails ON Emails.empid = Main.empid
ORDER BY Main.empid, phone, license, email;

It produces Cartesian product of all rows and duplicates rows. This is the result set of the query above. You can see that empid = 2 returned 12 rows, which is 4 phones multiplied by 3 emails and by 1 license. My guess is that you want to see only 4 rows for empid = 2. In other words, for each empid the result should have minimum possible number of rows (I'll show the correct result set in the end).

empid   phone   license email
1   5551234 NULL    NULL
2   5551111 L4455   xxx@abc
2   5551111 L4455   yyy@abc
2   5551111 L4455   zzz@abc
2   5552222 L4455   xxx@abc
2   5552222 L4455   yyy@abc
2   5552222 L4455   zzz@abc
2   5553333 L4455   xxx@abc
2   5553333 L4455   yyy@abc
2   5553333 L4455   zzz@abc
2   5554444 L4455   xxx@abc
2   5554444 L4455   yyy@abc
2   5554444 L4455   zzz@abc
3   NULL    L7890   NULL
4   NULL    NULL    NULL

Long variant

I'm not sure whether my proposed approach below is more efficient than yours. You'll have to try both and compare performance for your data.

We'll need a table of numbers. SQL, Auxiliary table of numbers http://web.archive.org/web/20150411042510/http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-numbers-table.html http://dataeducation.com/you-require-a-numbers-table/

Again, in real life you'll have a proper table of numbers, but for this example I'll use the following:

declare @TNumbers table (Number int);
insert into @TNumbers values (1);
insert into @TNumbers values (2);
insert into @TNumbers values (3);
insert into @TNumbers values (4);
insert into @TNumbers values (5);

The main idea behind my approach is to make a helper table that would contain correct number of rows for each empid at first and then use this table to get results efficiently.

We'll start with counting number of phones, licenses, e-mails for each empid:

WITH
CTE_Rows
AS
(
    SELECT Phones.empid, COUNT(*) AS EmpRows
    FROM @TPhones AS Phones
    GROUP BY Phones.empid

    UNION ALL

    SELECT Licenses.empid, COUNT(*) AS EmpRows
    FROM @TLicenses AS Licenses
    GROUP BY Licenses.empid

    UNION ALL

    SELECT Emails.empid, COUNT(*) AS EmpRows
    FROM @TEmails AS Emails
    GROUP BY Emails.empid
)

Then we calculate the maximum number of rows for each empid:

,CTE_MaxRows
AS
(
    SELECT
        CTE_Rows.empid
        ,MAX(CTE_Rows.EmpRows) AS MaxEmpRows
    FROM CTE_Rows
    GROUP BY CTE_Rows.empid
)

The CTE above has one row for each empid: empid itself and a maximum number of phones, licenses, e-mails for this empid. Now we need to expand this table and generate the given number of rows for each empid. Here I'm using the Numbers table for it:

,CTE_RowNumbers
AS
(
SELECT
    CTE_MaxRows.empid
    ,Numbers.Number AS rn
FROM
    CTE_MaxRows
    CROSS JOIN @TNumbers AS Numbers
WHERE
    Numbers.Number <= CTE_MaxRows.MaxEmpRows
)

Then we need to add row numbers to all tables with data, which we'll use for joining later:

,CTE_Phones
AS
(
    SELECT
        Phones.empid
        ,ROW_NUMBER() OVER (PARTITION BY Phones.empid ORDER BY phone) AS rn
        ,Phones.phone
    FROM @TPhones AS Phones
)
,CTE_Licenses
AS
(
    SELECT
        Licenses.empid
        ,ROW_NUMBER() OVER (PARTITION BY Licenses.empid ORDER BY license) AS rn
        ,Licenses.license
    FROM @TLicenses AS Licenses
)
,CTE_Emails
AS
(
    SELECT
        Emails.empid
        ,ROW_NUMBER() OVER (PARTITION BY Emails.empid ORDER BY email) AS rn
        ,Emails.email
    FROM @TEmails AS Emails
)

Now we are ready to join all this together. CTE_RowNumbers has exact number of rows that we need, so there is no need for complex FULL JOINs here, simple LEFT JOIN is enough:

,CTE_Data
AS
(
    SELECT
        CTE_RowNumbers.empid
        ,CTE_Phones.phone
        ,CTE_Licenses.license
        ,CTE_Emails.email
    FROM
        CTE_RowNumbers
        LEFT JOIN CTE_Phones ON CTE_Phones.empid = CTE_RowNumbers.empid AND CTE_Phones.rn = CTE_RowNumbers.rn
        LEFT JOIN CTE_Licenses ON CTE_Licenses.empid = CTE_RowNumbers.empid AND CTE_Licenses.rn = CTE_RowNumbers.rn
        LEFT JOIN CTE_Emails ON CTE_Emails.empid = CTE_RowNumbers.empid AND CTE_Emails.rn = CTE_RowNumbers.rn
)

We are almost done. I guess, it is possible that the main table has some empids that don't have any related data (no phones, no liceses, no e-mails), like empid = 4 in my sample data. To get these empids in the result set I'll left join the CTE_Data to the main table:

SELECT
    Main.empid
    ,CTE_Data.phone
    ,CTE_Data.license
    ,CTE_Data.email
FROM
    @TMain AS Main
    LEFT JOIN CTE_Data ON CTE_Data.empid = Main.empid
ORDER BY Main.empid, phone, license, email;

To get the full script just put all code blocks from this post together in the same order as they appear here.

This is the result set:

empid   phone   license email
1   5551234 NULL    NULL
2   5551111 L4455   xxx@abc
2   5552222 NULL    yyy@abc
2   5553333 NULL    zzz@abc
2   5554444 NULL    NULL
3   NULL    L7890   NULL
4   NULL    NULL    NULL

Thanks, Vladimir. It looks like a good idea and I'll certaintly give it a try. — Fei-Xue, Dec 17 '14 at 14:28
@Abe, Ashley, and Spock: The reason I use row_number() is because I want all values to appear only once except for empid. Simple left joins give me Cartesian product instead. Vladimir's final result set is a good example of what I need. Thanks. — Fei-Xue, Dec 17 '14 at 14:34
OK. I've tested the suggested approach. CPU usage is no longer an issue and it's faster. Thanks, Vladimir. I referenced this web site to generate a temp table for row numbers: http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/you-require-a-numbers-table.aspx — Fei-Xue, Dec 17 '14 at 17:46

score -1 · Answer 2 · answered Dec 17 '14 at 07:19

-1

Is there a reason you're using row_number() over

It looks like you want something like this...

SELECT t1.empid, t2.phone, t3.license, t4.email
FROM @t1 t1
LEFT JOIN @t2 t2 ON t1.EmpId = t2.EmpId
LEFT JOIN @t3 t3 ON t1.EmpId = t3.EmpId
LEFT JOIN @t4 t4 ON t1.EmpId = t4.EmpId

answered Dec 17 '14 at 07:19

Spock

4,700
2
16
21

There is so much business logic which you would be missing. – SouravA Dec 17 '14 at 10:30

best way to "glue" columns together

2 Answers2

Simple variant

Long variant

Linked