1

I'm using PostgreSQL and I intend to paging. The target table contains 1M+ rows. In principle, this is straight forward

SELECT * FROM myTable ORDER BY orderCol LIMIT <pageSize> OFFSET <offset>;

Now, this is fast when the orderCol is indexed, but an order of magnitude slower when orderCol has no index. Obviously, the dbms is forced to perform a full table scan in the worst case and has to sort the data for each page requested.

[Edit: More specifically, orderCol might change, i.e., is determined at runtime.]

[Edit2: The general assumption that indexing orderCol improves sorting performance seems to be wrong. If I add an index to orderCol, query time increases about 70%.]

One obvious solution would be to create a temporary table as necessary with an appropriate index and fill the table with the appropriate data (…I think). But that duplicates all the data.

Is there a way how one can "retain" a sort order between requests? Or create a temporary index?

Many thanks for your answers in advance.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
dsd
  • 551
  • 5
  • 17
  • Just wait until you hit the "OFFSET 100000" problem. All DBs perform high offsets badly, even search DBs like Lucene. It's one of the unsolved problems of computing. – FuzzyChef Jun 05 '13 at 23:48
  • I performed such tests on a table with 1M entries using the primary key as the sort key. It doesn't make much of a difference whether the offset is large or not. – dsd Jun 06 '13 at 06:28
  • _`orderCol` might change_ Do you mean it is a parameter passed to the query? – Clodoaldo Neto Jun 06 '13 at 15:39
  • Yes. The column that is being ordered is a parameter defined by the application. – dsd Jun 07 '13 at 06:04
  • @FuzzyChef: The problem can be circumvented with the [seek method](http://blog.jooq.org/2013/10/26/faster-sql-paging-with-jooq-using-the-seek-method/) – Lukas Eder Oct 27 '13 at 09:05

3 Answers3

1

Ok, here is one solution I came up with.

The problem really is that deterministic row addressing and the relational model are incompatible. What I'm basically trying to do is to tell the database where to look next. But since requests are independent of each other and we cannot make any assumptions about the physical structure of the table, the only way to address a row is using a unique column value.

Hence the following solution:

CREATE TEMPORARY TABLE orderTable( id int, rank int );
CREATE INDEX orderIdx ON orderTable( rank );
INSERT INTO orderTable (
  select id, row_number() over (order by orderCol) as rank 
  from myTable ORDER BY orderCol
);

Now, I can fetch a page as follows:

SELECT myTable.id, orderCol 
FROM myTable JOIN orderTable ON myTable.id=orderTable.id 
WHERE rank >= <lower> AND rank <= <upper>;

This sounds crazy at first glance, but for pages sizes of about 128 it decraesed the query time by about an order of magnitude compared to using myTable with an index (and clustering) on orderCol.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
dsd
  • 551
  • 5
  • 17
  • Eek, one table per sort column! :-) What if you add predicates? That would make one table per column and predicate! Have you heard of the [seek method](http://stackoverflow.com/a/19616401/521799)? It allows for pagination in constant time... – Lukas Eder Oct 27 '13 at 09:14
1

You're running into several problems:

Yes, sorting columns that are not indexed is slow

You might really want to index all sortable column configurations, at least those that are sorted very often by your application. There is some interesting insight on that subject written in this blog.

OFFSET is slow

Even when you do have an index, jumping to high page numbers is slow as you will have to traverse the whole index to do the OFFSET counting. Try to see if you can use the "seek method" instead.

The seek method essentially jumps to the first record after the last record from the previous page, e.g.

SELECT * 
FROM myTable 
WHERE orderCol > :lastValueforOrderCol
ORDER BY orderCol
LIMIT <pageSize>;

Now that you're no longer accessing records by offset, but by using a predicate, indexing all eligible orderCols is essential.

Note, this method doesn't allow you to jump to a fixed ordinal position, like OFFSET. It behaves more like Twitter's lazy-loading of "subsequent tweets". This may or may not be desireable.

Note, the "seek method" is also called keyset paging.

Full table scans can be faster than index scans

Since you do not have any predicates, it may indeed be faster to perform a dumb full table scan and perform the sorting in memory, instead of loading all index b-tree nodes (which may be scattered on disk) to skip rows. This observation would probably be reversed, once you add selective predicates.

I'm surprised, though, that PostgreSQL's optimiser wouldn't automatically choose the full table scan.

Community
  • 1
  • 1
Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
0

what prevents you from just indexing this column?

I had a similar problem, but for a 20GB/40M+ row table with lots of "where" conditions. The data was static so I had the DW Server run a daily script which just extracted the relevant data and created a 150k table.

UPDATE

Edit: More specifically, orderCol might change, i.e., is determined at runtime

by that you mean that the values within the order column will change every time someone runs the query (or that the column can be different, column1, colume2, ...)?

look into materialized views. http://wiki.postgresql.org/wiki/Materialized_Views

You could create a view on this query and then run all queries from this view (and drop them every x min/hour/day via a script). much easier to handle than temp tables.

other than that there are some tricks depended on the detailed use case but no out of the box solution

Chris
  • 129
  • 1
  • 2
  • 11
  • Unfortunately, indexing and clustering does not help much. The reason is, unless the column to be ordered contains only unique values, the dbms still needs to sort all data in order to find out the position of the x'th entry in the ordered set. – dsd Jun 07 '13 at 08:05
  • @dsd, the larger issue is that ordered columns may change. PostgreSQL *can* use an index to retrieve a small portion of a table via order and limit but in your case this may mean too many indexes. – Chris Travers Sep 01 '13 at 02:49