28

Possible Duplicate:
How to provide an API client with 1,000,000 database results?

Wondering of the use of Cursors is a good way to implement "paging" using PostgreSQL.

The use case is that we have upwards 100,000 rows that we'd like to make available to our API clients. We thought a good way to make this happen would be to allow the client to request the information in batches ( pages ). The client could request 100 rows at a time. We would return the 100 rows as well as a cursor, then when the client was ready, they could request the next 100 rows using the cursor that we sent to them.

However, I'm a little hazy on how cursors work and exactly how and when cursors should be used:

  • Do the cursors require that a database connection be left open?
  • Do the cursors run inside a transaction, locking resources until they are "closed"?
  • Are there any other "gotchas" that I'm not aware of?
  • Is there another, better way that this situation should be handled?

Thanks so much!

Community
  • 1
  • 1
Chris Dutrow
  • 48,402
  • 65
  • 188
  • 258
  • 2
    What sort of clients are we talking about? In particular, HTTP clients are prone to just "disappear"... – Colin 't Hart Oct 30 '12 at 15:56
  • Yup, HTTP clients, so they can and will disappear. – Chris Dutrow Oct 30 '12 at 15:58
  • Please cross-link questions if you must post more than one. – Craig Ringer Oct 30 '12 at 23:51
  • 3
    Not an exact duplicate. The reason I posted two questions is that one addresses the more general question of "How should I get the data to the client"? This one is asking "If I do paging, whats the deal with that?". You guys who closed this question are brilliant developers with tons of answered questions. None of you have very much experience asking questions though. Brilliant developers often are unfaimiliar with the perspective of the novice. If you don't ask a lot of questions, then liberally close well formatted and articulated questions, you are trolling in your own right. – Chris Dutrow Oct 31 '12 at 15:32
  • utilizing "limit", "order by", and "seek", refer to http://leopard.in.ua/2014/10/11/postgresql-paginattion/ – mission.liao Sep 01 '15 at 06:17
  • 2
    Please note that the term "cursor" in connection with REST APIs is a completely different thing to database query cursors. I suspect this question, and many of the answers are mixing the two. – Jason Jan 10 '16 at 22:12

2 Answers2

38

Cursors are a reasonable choice for paging in smaller intranet applications that work with large data sets, but you need to be prepared to discard them after a timeout. Users like to wander off, go to lunch, go on holiday for two weeks, etc, and leave their applications running. If it's a web-based app there's even the question of what "running" is and how to tell if the user is still around.

They are not suitable for large-scale applications with high client counts and clients that come and go near-randomly like in web-based apps or web APIs. I would not recommend using cursors in your application unless you have a fairly small client count and very high request rates ... in which case sending tiny batches of rows will be very inefficient and you should think about allowing range-requests etc instead.

Cursors have several costs. If the cursor is not WITH HOLD you must keep a transaction open. The open transaction can prevent autovacuum from doing its work properly, causing table bloat and other issues. If the cursor is declared WITH HOLD and the transaction isn't held open you have to pay the cost of materializing and storing a potentially large result set - at least, I think that's how hold cursors work. The alternative is just as bad, keeping the transaction implicitly open until the cursor is destroyed and preventing rows from being cleaned up.

Additionally, if you're using cursors you can't hand connections back to a connection pool. You'll need one connection per client. That means more backend resources are used just maintaining session state, and sets a very real upper limit on the number of clients you can handle with a cursor-based approach.

There's also the complexity and overhead of managing a stateful, cursor-based setup as compared to a stateless connection-pooling approach with limit and offset. You need to have your application expire cursors after a timeout or you face potentially unbounded resource use on the server, and you need to keep track of which connections have which cursors for which result sets for which users....

In general, despite the fact that it can be quite inefficient, LIMIT and OFFSET can be the better solution. It can often be better to search the primary key rather than using OFFSET, though.

By the way, you were looking at the documentation for cursors in PL/pgSQL. You want normal SQL-level cursors for this job.


Do the cursors require that a database connection be left open?

Yes.

Do the cursors run inside a transaction, locking resources until they are "closed"?

Yes unless they are WITH HOLD, in which case they consume other database resources.

Are there any other "gotchas" that I'm not aware of?

Yes, as the above should explain.

Community
  • 1
  • 1
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • This topic is interesting. As of today, Graphql recommend [the pagination with cursor](https://graphql.org/learn/pagination/), but I still confused the performance for implementing It. – Tokenyet Jun 22 '21 at 07:31
-1

For HTTP clients, don't use cursors to implement paging. For scalability, you don't want server resources tied up between requests.

Instead, use LIMIT and OFFSET on your queries; see LIMIT and OFFSET in the Pg docs.

But make sure that the indexing on your tables will support efficient queries of this form.

Design a RESTful API, so that the client can invoke the "next_url" (also passed in the response) to get the next set of rows.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
Colin 't Hart
  • 7,372
  • 3
  • 28
  • 51
  • 13
    Aren't OFFSET queries very inefficient for very large offsets (like 500,000 or 1,000,000)? – Chris Dutrow Oct 30 '12 at 16:01
  • 1
    Why allow the clients to retrieve 100 rows at a time for such huge datasets? I'd say, either the clients retrieve a lot of rows at once (preferably *all* the rows they need), or very small sets of rows for paging on a real screen -- where end users aren't likely to page through 10+ screens to get to the records they want. – Colin 't Hart Oct 30 '12 at 16:03
  • 1
    We need to provide the clients will all of the information. I was under the impression that returning 1,000,000 rows via an HTTP connection like that would be problematic since the HTTP connection would have to be left open for quite a while? – Chris Dutrow Oct 30 '12 at 16:04
  • Do you have other logical ways to partition your data (rather than an arbitrary by number of rows)? – Colin 't Hart Oct 30 '12 at 16:06
  • How wide are the rows? How many Mb is that 1M rows? Are you compressing the data returned to the client? What sort of internet connectivity do your clients have? – Colin 't Hart Oct 30 '12 at 16:08
  • I'm not sure how else we could partition it. The client would need all 1,000,000 rows either way, but we could use random numbers. The rows are sometimes small, but other times quite wide, they're really more like "documents". – Chris Dutrow Oct 30 '12 at 16:12
  • How many clients do you have? If it's just a small number, and they would retrieve all subsequent requests, you could use the cursor and close it automatically when there are no more rows. But if you have many different clients this isn't a scalable solution. – Colin 't Hart Oct 30 '12 at 16:29
  • Very few clients, so that could work. Was hoping for something simpler, but it might have to be this. – Chris Dutrow Oct 30 '12 at 16:57