TSQL query to merge data from multiple tables that may or may not have matching rows?

Question

For example, suppose we're conducting research where students can take up to 10 different tests, and each table in the database stores all the students' responses for one test. The tables are named after each test as: T1, T2, ... , T10. Suppose each table has a primary key column 'Username' that identifies each student. Students may or may not have completed each test, so there may or may not be a record in each table for each student.

What is the correct SQL Query to return all the test data from all tables, with one row per student (one row per username)? I want the simplest query possible that returns the correct results. I would also like to coalesce the Username fields into a single Username field in the final query.

To clarify, I understand that SQL has a major limitation in that it does not support a syntax to select all columns except one or more fields like "select *[^ExcludeColumn1][^ExcludeColumn2]". To avoid specifically naming all columns in the final query, it would be acceptable to leave all the Username columns there, as long as it includes a coalesced Username field at the beginning named something like RowID.

As for the overall query, one option would be to perform a union all on the username column of all ten tables, then select the distinct usernames across all tables, then perform a series of left joins against the list of distinct usernames on all 10 tables. That would result in a very straightforward query where each left join is performed on the same distinct set of usernames, but I want to avoid a separate up-front query for distinct usernames. (Although if that's the best option, let me know). It would look something like this:

select * from
(select distinct coalesce(t1.Username,t2.Username,...,t10.Username) as RowID from t1,t2,t3,t4,t5,t6,t7,t8,t9,t10) distinct_usernames
left join t1 on t1.Username =  distinct_usernames.RowID
left join t2 on t2.Username =  distinct_usernames.RowID
...
left join t10 on t10.Username =  distinct_usernames.RowID

Although that is short and easy to write, it is incredibly inefficient and would take hours to run on test tables with 5000+ rows each, so with an adjustment, an equivalent version that runs in a few seconds is:

select * from (
select distinct Username as RowID from (
select Username from t1
union all
select Username from t2
union all
...
select Username from t10
) all_usernames) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID

I think that what I have above might be the most efficient and correct query (takes only a couple seconds to run and returns correct result set), but I also thought perhaps it could be simplified with some kind of full join. The problem is that full joins get confusing with more than two tables, because without pre-determining the usernames, each subsequent table would have to match records against any of the preceding tables, resulting in a query where each additional table has "[previous table count] + 1" conditions on matching the username.

I would use the SQL Database Engine Tuning Adviser and see how efficient you could get it. I really don't see any other way to query the results you want, so you may just have to focus your efforts on optimization. — Narnian, Feb 25 '13 at 19:31
Yeah, I figured it was probably close to being optimal, but as Tim posted, I could eliminate the 'all' and 'distinct' clauses by just using 'union'. My thoughts were that merging multiple (unbalanced) tables on some known set of IDs was a pretty common situation that should have a very straightforward syntax. This kind of operation could use its own syntax in SQL like "select * from merge (t1,t2,t3,t4,t5) on [PrimaryKeyColName]", simply because a full join isn't practical, and the series of left joins is optimal, but only works correctly when joined against a complete list of distinct IDs. — Triynko, Feb 25 '13 at 21:28

score 2 · Accepted Answer · edited May 23 '17 at 12:27

Assuming that Username is unique in each table, your second query would be the way I would try first, with the slight modifications of removing distinct and simply using union (which implies distinct) rather than union all:

select *
from (
        select Username from t1
        union
        select Username from t2
        union
        -- ...
        select Username from t10
    ) distinct_usernames
    left join t1 on t1.Username = distinct_usernames.Username
    left join t2 on t2.Username = distinct_usernames.Username
    -- ...
    left join t10 on t10.Username = distinct_usernames.Username

From there I would make sure that Username is indexed, possibly even using it as the clustered index. I've also had optimization luck in the past by implementing your distinct_usernames as a temp table (possibly indexed, or an indexed view) at the beginning of the proc, but only testing would determine if that were worthwhile.

A full outer join would require a bunch of or conditions or coalesce arguments, though it could be worth a try on just a few tables to see if the performance is there. I can't try to out-guess what your query engine will like best.

Also, getting just the column names that you want could be done with a query to sys.columns or information_schema.columns and using dynamic SQL to build your query as a string and then executing that.

Thanks for pointing out that "union" by itself is distinct by default, so I can leave out 'all' and wrapping 'distinct' query. I think we agree that left joins are probably more straightforward and efficient than anything that could be accomplished with a full join. I already had Username set as the primary key (clustered), since I knew that would speed it up considerably. What happened was my boss gave me 3 Excel files and asked me to import and merge them on username in SQL Server, so I had to base the unique user list on what was in those files, rather than on our actual user database. — Triynko, Feb 25 '13 at 21:12
Gotcha, you were pretty much there, then. So did anything actually improve performance? — Tim Lehner, Feb 25 '13 at 21:25
Yes, but not noticeably since the query takes only about 3 seconds to run in both cases. From looking at the query plan, using "union" by itself causes a single "Merge Join (Union)" to take the place of a "Stream Aggregate (Aggregate) and Merge Join (Concatenation)". Also, another Merge Join switches from Concatenation to Union. See plan here: http://i.imgur.com/UbrOqCw.png for before and after. — Triynko, Feb 25 '13 at 21:59

TSQL query to merge data from multiple tables that may or may not have matching rows?

1 Answers1