0

Consider we have complex union select from dozens of tables with different structure but similar fields meaning:

SELECT a1.abc as field1,
       a1.bcd as field2,
       a1.date as order_date,
FROM a1_table a1
UNION ALL
SELECT a2.def as field1,
       a2.fff as field2,
       a2.ts  as order_date,

FROM a2_table a2
UNION ALL ...

ORDER BY order_date

Notice also that results in general are sorted by "synthetic" field order_date.

This query gives huge number of rows, and we want to work with pages from this set of rows. Each page is defined by two parameters:

  • page size
  • field2 value of last item from previous page

Most important thing that we can not change the way can page is defined. I.e. it is not possible to use row number of date of last item from previous page: only field2 value is acceptable.

Current algorithm of paging is implemented in quite ugly way:

1) query above is wrapped in additional select with row_number() additional column and then wrapped in stored procedure union_wrapper which returns appropriate table ( field1 ..., field2 character varying),

2) then complex select performed:

 RETURN QUERY
    with tmp as (
      select
        rownum, field1, field2 from union_wrapper()
    )
    SELECT field1, field2
    FROM tmp
    WHERE rownum > (SELECT rownum
                    FROM tmp
                    WHERE field2 = last_field_id
                    LIMIT 1)
    LIMIT page_size

The problem is that we have to build in memory full union-select results in order to later detect row number from which we want to cut new page. This is quite slow and takes unacceptable much time to perform.

Is any way to reconfigure this operations in order to significantly reduce query complexity and increase its speed?

And again: we can not change condition of paging, we can not change structure of the tables. Only way of rows retrieving.

UPD: I also can not use temp tables, because I'm working in read-replica of the database.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Andremoniy
  • 34,031
  • 20
  • 135
  • 241
  • `ORDER BY order_date` <--> `WHERE field2 = last_field_id` I'd expect a highwatermark to be at least the leading part of the ordering/index. And, since your data model is crippled, you could consider a temp table or a marerialised view. – joop Oct 27 '16 at 11:31
  • @joop I can not use temp tables as I'm working in read-replica. Sorry, I didn't mention this in my question, so I will update it – Andremoniy Oct 27 '16 at 11:38
  • If you have to use all those `UNION ALL` anyway, I believe `row_number over()` is not that ugly for pagination – Vao Tsun Oct 28 '16 at 09:00
  • @VaoTsun I do not quite understand your statement, sorry. Could you clarify it please? – Andremoniy Oct 28 '16 at 09:23
  • I meant that most of cost comes when you `union all` all tables. and probably next most costy is subselect?.. So I'm saying that using `row_number` here does not see ugly. - Maybe I misunderstood you? – Vao Tsun Oct 28 '16 at 09:56
  • @VaoTsun I've never said that `row_number` is core of the problem. Core of the problem is that I have to ask DB build full list of rows and only after that cut page from it. – Andremoniy Oct 28 '16 at 10:29

1 Answers1

0

You have successfully maneuvered yourself into a tight spot. The query and its ORDER BY expression contradict your paging requirements.

ORDER BY order_date is not a deterministic sort order (there could be multiple rows with the same order_date) - which you need before you do anything else here. And field2 does not seem to be unique either. You need both: Define a deterministic sort order and a unique indicator for page end / start. Ideally, the indicator matches the sort order. Could be (order_date, field2), which both columns defined NOT NULL, and the combination UNIQUE. Your restriction "only field2 value is acceptable" contradicts your query.

That's all before thinking about how to get best performance ...

There are proven solutions with row values and multi-column indexes for paging:

But drawing from a combination of multiple source tables complicates matters. Optimization depends on the details of your setup.

If you can't get the performance you need, your only remaining alternative is to materialize the query results somehow. Temp table, cursor, materialized view - the best tool depends on details of your setup.

Of course, general performance tuning might help, too.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228