13

I need help also about paging and using UNION ALL for multiple tables:

How do i implement an optimized paging when joining multiple tables using UNION ALL and returning only specific number of rows...


declare @startRow int
declare @PageCount int

set @startRow = 0
set @PageCount = 20

set rowcount @PageCount

select Row_Number() OVER(Order by col1) as RowNumber, col1, col2
from
(
    select col1, col2 from table1 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table2 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table3 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table4 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table5 where datetimeCol between (@dateFrom and @dateTo)
) as tmpTable
where RowNumber > @startRow

table 3, 4, & 5 have huge number of row (millions of rows) where table 1 & 2 may only have few thousand rows.

If startRow is "0", I only expect data from Row 1 to 20 (from Table1). I'm getting the correct result but has a high overhead on the remaining table while sql server tries to all all the data and filter it....

the longer the interval of the @dateFrom and @dateTo makes my query significantly slower while trying to retrieve only few rows from the overall result set

Please help how i can implement a simple but better approach with a similar logic. :(

XING
  • 9,608
  • 4
  • 22
  • 38
  • If you need to `Order by col1`, it's hard to do without fetching all `col1`'s from all tables and sort them. Since you seem to want table1 first, then table2 etc, is the `order by` really correct? Is the data arranged that way in the tables? – Joachim Isaksson Jun 06 '13 at 17:17
  • i still have to allow the script to be ordered by col1 or col2 selectively.. – Robert Dominic Angelo Jun 06 '13 at 17:29
  • Will the LIMIT statement do what you want? – Riking Jun 06 '13 at 17:48
  • I believe LIMIT is not for SQL Server, however, I already have that similar functionality by using Row_Number() – Robert Dominic Angelo Jun 06 '13 at 17:57
  • `LIMIT` is synonymous with `TOP`, not `ROW_NUMBER()`. I think the only way you can do this is by persisting the `UNION ALL` subquery, but that means you have to update a view every time an underlying table is updated, etc. – tommy_o Jun 06 '13 at 18:25
  • Err sorry, you can't use my answer. I thought you could use indexed views with `UNION ALL` but not `UNION` but I was wrong. – tommy_o Jun 06 '13 at 18:34
  • 2
    Do all of the tables have an index on `datetimeCol, col1 INCLUDE (col2)`? You say that "I only expect data from Row 1 to 20 (from Table1)." So does this mean that there is some sort of partitioning between the tables based on the value of `col1`? – Martin Smith Oct 27 '13 at 11:37
  • Really that is working? You are using RowNumber in the same statement that it was created? It does not work that way. – paparazzo Jun 26 '16 at 12:27

5 Answers5

4

Consider using OFFSET FETCH clause (works starting with MSSQL 2012):

declare @startRow int
declare @PageCount int

set @startRow = 0
set @PageCount = 20


select col1, col2
from
(
    select col1, col2 from table1 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table2 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table3 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table4 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table5 where datetimeCol between (@dateFrom and @dateTo)
) as tmpTable
order by col1
offset @startRow rows
fetch next @PageCount rows only

I also want to mention here, why this query always takes O(n*log(n)) time.
To execute this query, database needs to:

  1. to union multiple lists into one list - takes O(n) time for each table, where n - total number of rows in your tables;
  2. sort list by col1 - takes O(n*log(n)), where n - is total number of rows
  3. traverse the list in sorted order, skip @startRow rows, take next @PageCount rows.

As you can see, you need to union and sort all data to get the expected result (number 3).

If the performance of this query is still poor and you want to increase in, try to:

  • create clustred index based on col1 in all tables
  • create a non-clusteres index based on col1 in all tables and **inlude all other columns that you want to output in select list**.
1

There maybe an issue with your database design since you have 5 similar tables. But besides this, you could materialize your UNION ALL query into a permanent table or a temp #-table with appropriate indexes on it and finally paginate over materialized data set with ROW_NUMBER() clause.

YuGagarin
  • 341
  • 7
  • 20
0

Instead of applying classic OFFSET based paging (note that SQL Server 2012 now natively supports it), I think your particular use case could greatly profit from a method often referred to as the "seek method" as described in this blog post here. Your query would then look like this.

select top 20 col1, col2
from
(
    select col1, col2 from t1 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t2 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t3 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t4 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t5 where datetimeCol between (@dateFrom and @dateTo)
) as tmpTable
where (col1 > @lastValueForCol1)
   or (col1 = @lastValueForCol1 and col2 > @lastValueForCol2)
order by col1, col2

The @lastValueForCol1 and @lastValueForCol2 values are the respective values of the last record from the previous page. This allows you to fetch the "next" page. If the ORDER BY direction is DESC, simply use < instead. If (col1, col2) is not globally unique across your tmpTable, you may need to add another column to the query and to the WHERE and ORDER BY clause to avoid losing records between pages.

With the above method, you cannot immediately jump to page 3 without having first fetched the previous 40 records. But often, you do not want to jump that far anyway. Instead, you get a much faster query that might be able to fetch data in constant time, depending on your indexing. Plus, your pages remain "stable", no matter if the underlying data changes (e.g. on page 1, while you're on page 4).

Note, the "seek method" is also called keyset paging.

Indexing

While paging with the "seek method" is always faster than when using OFFSET, you should still make sure that (col1, col2) is indexed in each one of your tables!

Community
  • 1
  • 1
Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
  • since 'union all' is used in stead of 'union', there is no way to guarantee the records are unique. If each table has an unique constraint or primary key on col1,col2 than you could tweak this solution by adding the tablenr as third column to make alle records unique. – Peter May 30 '17 at 15:09
  • @Peter: Look at the question. UNION ALL is already mentioned there... Besides, nothing in the question implies 1) that distinctness is a desirable feature here, 2) that data inside of each individual tables is already distinct. So turning this into UNION might just as well be plain wrong. – Lukas Eder May 30 '17 at 15:15
  • @LukasEder Is it somehow possible to limit the inner union queries to just return 20 rows (based on top clause), right now inner queries return all the matching rows (reading more data than needed) and then takes out top 20 from the result set. – Ammar Khan Aug 07 '20 at 11:49
  • @AmmarKhan: Yes, you can. – Lukas Eder Aug 07 '20 at 12:37
  • @LukasEder If I put, let say Top 20 clause in each of the inner queries that will take out 20 matching rows from each and that will make it total 100 rows (still read more data than needed) in the result set and then have the outer query take out 20 from the result set. And when doing pagination outer query will get the rows of next page within the 100 matching rows. Most probably my approach is wrong. Can you please brief a bit how that could be done? – Ammar Khan Aug 07 '20 at 13:45
  • @AmmarKhan: Just ask a new question with some details. It will work better than here, in the comments. – Lukas Eder Aug 07 '20 at 18:40
  • @LukasEder Can you please check this [Keyset Pagination - Filter By Search Term within Multiple Columns](https://stackoverflow.com/questions/63316273/keyset-pagination-filter-by-search-term-within-multiple-columns) – Ammar Khan Aug 08 '20 at 14:04
0
select col1, col2 from table1 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table2 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table3 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table4 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table5 where datetimeCol between (@dateFrom and @dateTo)

is essentially as efficient as a normal table provided there is an index on the sorting key that you use for paging. This usually results in a query plan where all tables are merge concatenated. Merge concatenating is a streaming operation. It's cost is proportional to the number of rows drawn, not the number of rows in the tables.

Paging by row number in SQL Server always works by enumerating rows start to end until the desired window is reached. Whether the rows are drawn from a table or from multiple merged tables does not make a fundamental difference.

So a good chance to make this fast is to create a covering index keyed on col1. Unfortunately, it's not possible to index for between (@dateFrom and @dateTo) at the same time. So you have to try both indexing strategies and pick what works best.

usr
  • 168,620
  • 35
  • 240
  • 369
  • The index columns would be datetimecol asc, col1 asc include col2. That would allow the range filter, the row_number order and include the col2 to complete the query without a lookup. – Wes H Dec 20 '16 at 19:30
-1

Since the tables are ordered in the result set for the paging (Union ALL does not sort), there is no reason to select from all 5 tables. You should change the code to :

  • Query from Table 1. See if you have enough records.
  • If not Query from Table 2, and so on.

Managing the offset count according to the number of records queried each time. This way, you only query the tables you need.

You can even optimize by selecting the number of records in a table according to the filter to know if you need to query any data from it. So if you want records 30-50, and table1 only has 20 matching records in it, you can skip it altogether.

Amir Pelled
  • 591
  • 4
  • 13