How to ensure data integrity in paginated REST API?

Question

I am working on REST API currently. The resource that the API returns is expected to be huge data from database (tens of millions of row in DB). Pagination is a must in order to avoid enormous memory consumption when writing the data to HTTP response.

How to make sure the data integrity when there is deletion/addition of rows in DB in between client requests?

For example:

page 1: [ John, Mary, Harry, David, Joe ]
page 2: [ Mike, Don, Alex ]

After client requested for page 1 and stored it locally in (file/memory), before asking for page 2, the data is changed to:

page 1: [ John, Mary, Harry, David, **Mike** ]
page 2: [ Don, Alex, **Terry** ]

Aurélien Bénel · Accepted Answer · 2013-03-08T09:07:26.477

3

A true RESTful (and therefore server-side stateless) answer would be:

ask for the first five records (the last one is "Joe"),
then ask for the five records superior[1] to "Joe",
and so on.

With this strategy you'll get "Mike" and "Terry" in page #2.

[1] They must have a sort order (alphabetical or other).

edited Mar 08 '13 at 09:07

answered Mar 08 '13 at 09:00

Aurélien Bénel

3,775
24
45

"superior" is a confusing word here... You need to be able to ask for the next page (N items) after the record "Joe"... you would get an error saying that "Joe" is no longer in existence, so the reference would be invalid, and the client would then request the 1st page again. – Peter May 05 '15 at 01:01
@Peter No, "Joe" is a pure string value here, not an actual item. So, the fact that Joe's item doesn't exist anymore has no effects on it. This strategy is a well known and field proven "recipe" in CouchDB community. – Aurélien Bénel May 05 '15 at 05:21
1

I don't understand, how does that help if Garry is added after the first page is requested? – Alexandre Cassagne Nov 18 '19 at 14:56

score 1 · Answer 2 · answered Mar 08 '13 at 03:46

One solution to this is to return a "temporary" resource representing the query result set, and then allow the client to paginate through that using GETs.

For example:

GET /big-query/all-users
Returns: /query-results/12345 

GET /query-results/12345?page=1
Returns: users 1-20

GET /query-results/12345?page=2
Returns: users 21-40

The obvious issue with this solution is that changes to the actual users won't be reflected in the query result set, so you should make that clear in your API docs. Also, it would be good to "expire" the result set after a reasonable amount of time to (a) prevent it from going stale and (b) to allow your server to reap the memory it is holding hostage.

The other approach is to re-issue the query each time and then paginate into the result set to find the right chunk of data to return. That is stateless and requires no eviction strategy like the earlier idea, but it does mean that the query will be re-run each time. The good part of it is the results will be as accurate as possible with each pagination.

You are right about the disadvantage of the first solution. The result set is meant to be changed from time to time and the timing is not under my API's full control, hence there should not be a way to determine when is the best timing to invalidate the result set in the "snapshot" view. So re-running the query should be more appropriate in this context. — seriousmegalor, Mar 08 '13 at 15:40

How to ensure data integrity in paginated REST API?

2 Answers2

Linked