Best performance in sampling repeated value from a grouped column

Question

This question is about the functionality of first_value(), using another function or workaround.

It is also about "little gain in performance" in big tables. To use eg. max() in the explained context below, demands spurious comparisons. Even if fast, it imposes some additional cost.

This typical query

SELECT x, y, count(*) as n 
FROM t 
GROUP BY x, y;

needs to repeat all columns in GROUP BY to return more than one column. A syntactic sugar to do this, is to use positional references:

SELECT x, y, count(*) as n 
FROM t 
GROUP BY x, 2  -- imagine that 2, 3, etc. are repeated with x

Sometimes needs not only sugar, but also some semantic to understand complex context:

SELECT x, COALESCE(y,z), count(*) as n 
FROM t 
GROUP BY x, y, z  -- y and z are not "real need" grouping clauses?

I can imagine many other complex contexts. Let's see usual solutions:

SELECT x, max(y) as y, count(*) as n 
FROM t 
GROUP BY x  -- best semantic! no need for other columns here

where max() function can be any "sample()" (eg. first or last value). The performance of something that do nothing is better than max(), e.g. the aggregate function first_value(), but it needs a WINDOW, so lost performance. There are some old suggestions to implement first/last agg functions in C.

Is there some "get any one value fast" aggregate function with better performance than max() or GROUP BY X,2,...?
Perhaps some new feature in a recent release?

Please [edit] your question and add some sample data and the expected output based on that data. `max()` will be pretty fast if you have an index on the columns. You might want to look into `limit` or `distinct on ()` Also if you *do* have slow queries, provide the queries, the full table definition and the execution plan using `explain (analyze, verbose)` — , Mar 24 '16 at 10:17
I don't understand what you mean by `max()` function can be any "sample()". Did you mean "aggregate function"? Also if the question is how to make aggregate functions faster, what has all of the introduction about syntactic sugar got to do with it? — William Robertson, Mar 24 '16 at 10:23
There is a way to emulate loose index scan on postgres which would be the fastest https://wiki.postgresql.org/wiki/Loose_indexscan — Mihai, Mar 24 '16 at 10:24
Sorry @a_horse_with_no_name and other all, I edited, better now? — Peter Krauss, Mar 24 '16 at 15:42
@WilliamRobertson I edited to explain context, it is explained now? — Peter Krauss, Mar 24 '16 at 15:43
@Mihai thanks the good clue and link. Perhaps what `first_value()` needs in this case is a window with *loose indexscan*... but Postgres does not support it, and I need a so simple query, the CTE strategy seems good for more complex contexts. — Peter Krauss, Mar 24 '16 at 15:51
[This question](http://dba.stackexchange.com/questions/96364/is-it-possible-to-get-seek-based-parallel-plan-for-distinct-group-by) is about SQL Server, not Postgres, but it seems to be about the similar problem and solutions (lateral join and recursive CTE) should work in Postgres too. The main point is: when you do `GROUP BY` the server reads all rows of a table/index. If the number of distinct values in the grouped column is much less than the number of rows in a table it is better to do index seek for each distinct grouped value. — Vladimir Baranov, Mar 24 '16 at 22:57
You have a *count* in your query, which changes the nature of the problem. In your question you only ask for a quick sampling method while aggregating. There are fast index-backed alternatives for other aggregates ... But all bets are off, if you need to count all rows anyway. — Erwin Brandstetter, Mar 25 '16 at 15:19
@ErwinBrandstetter, thanks a lot, polishing my english and commenting with good clues. Do you have a link explaining better the context of your "... which changes the nature of the problem"? On my mind I can also add or replace the `count()` of above examples by `jsonb_agg(z)` or any other agg_func. An obvious solution is to reuse the natural "implicit WINDOW" of the query in *first_value()* function, but unfortunately PostgreSQL not offer any implicit thing -- and I supposed that to add any other WINDOW have performance costs. — Peter Krauss, Mar 25 '16 at 15:31

score 8 · Answer 1 · edited May 23 '17 at 11:58

8

If you really don't care which member of the set is picked, and if you don't need to compute additional aggregates (like count), there is a fast and simple alternative with DISTINCT ON (x) without ORDER BY:

SELECT DISTINCT ON (x) x, y, z FROM t;

x, y and z are from the same row, but the row is an arbitrary pick from each set of rows with the same x.

If you need a count anyway, your options with regard to performance are limited since the whole table has to be read in either case. Still, you can combine it with window functions in the same SELECT:

SELECT DISTINCT ON (x) x, y, z, count(*) OVER (PARTITION BY x) AS x_count FROM t;

Consider the sequence of events in a SELECT query:

Best way to get result count before LIMIT was applied

Depending on requirements, there may be faster ways to get counts:

Fast way to discover the row count of a table in PostgreSQL

In combination with GROUP BY the only realistic option I see to gain some performance is the first_last_agg extension. But don't expect much.

For other use cases without count (including the simple case at the top), there are faster solutions, depending on your exact use case. In particular to get "first" or "last" value of each set. Emulate a loose index scan. (Like @Mihai commented):

Optimize GROUP BY query to retrieve latest record per user

edited May 23 '17 at 11:58

Community

1
1

answered Mar 25 '16 at 15:33

Erwin Brandstetter

605,456
145
1,078
1,228

Thank you again. I will test [first_last_agg](http://pgxn.org/dist/first_last_agg/), seems what I need (!)... Then will return here (in few days) to comment it, and your discussion. – Peter Krauss Mar 25 '16 at 15:43
1

... I am [waiting first_last review in Github](https://github.com/wulczer/first_last_agg/issues/2)... But do some homework: the `DISTINCT ON` is not a direct solution because, as you commented and [I tested](http://dba.stackexchange.com/q/133520/90651), not optimize `GROUP BY`, neither lead to remove columns from the clause. The ideal solution was [cited by Craig here](http://stackoverflow.com/a/8373384/287948), is the `ANY_VALUE()` defined in MySQL 5.7+, that offers a correct semantic to this task (and SQL parser decides if use first or last as sample). – Peter Krauss Mar 30 '16 at 03:25
@PeterKrauss: I added an option to combine aggregates with `DISTINCT ON`. – Erwin Brandstetter Mar 31 '16 at 12:21
Hi. The ideal answer is a solution with "aggregate function with better performance than `max()`" (question text), that not exist for PostgreSQL, as you and @rpy asserted... So ideal is to share the bounty... But, the best clue about a workaround is your `first_last_agg`, so you must receice de bounty. All other discussion was so good (!), with a taste of philosophizing ;-) On my opinion, the philosophical ideal [is the MySQL's `ANY_VALUE()`](http://dba.stackexchange.com/a/133747/90651), and I showed there why, unfortunately, `DISTINCT ON` is not a solution for the explained problem. – Peter Krauss Apr 01 '16 at 04:22

score 5 · Answer 2 · answered Mar 24 '16 at 21:26

5

Not an offical source, but some thoughts an a question perceived as rather generic:

In general aggregators neeed to process all matching rows. From your question text you might target aggregators that try identifying specific values (max, min, first, last, n-th, etc). Those could benefit from datastructures that maintain the proper values for a specific such aggregator. Then "selecting" that value can be sped up drastically.
E.g. some databases keep track of max and min values of columns.
You can view this support as highly specialised internal indexs that are maintained by the system itself and not under (direct) control of a user.

Now postgresql focusses more on support that helps improving queries in general, not just special cases. So, they avoid adding effort for speeding up special cases that are not obviously benefitting a broad range of use cases.

Back to speeding up sample value aggregators.

With aggregators having to process all rows in general case and not hving a general strategy that allows short circuiting that requirement for aggregators that try identying specific values (sample kind aggregators for now), it is obvious that any reformulating of a query that does not lead to a reduced set of rows that need to be processed, will take similar time to complete.

For speeding up such queries beyond processing all rows you will need a supporting datastructure. With databases this usually is provided in the form of an index.

You also could benefit from special execution operations that allow reducing the number of rows to be read.

With pg you have the capability of providing own index implementation. So you could add an implementation that best supports a special kind of aggregator you are interested in. (At least for cases where you do need to run such queries often.)

Also, execution operations like index only scans or lazy evaluation with recursive queries may allow writing a specific query in a way that speeds compared to "straight" coding.

If you are targeting your question more into general approaches you might better consult with researchers on such topics as this then is beyond anything SO is intended to provide.

If you have specific (set of) queries that need to be improved, providing explicit questions on those might allow the community to help identifying potential optimizations. Trying to optimize without good base of measurement leads nowhere, as what yields perfect result in one case might kill performance in another.

answered Mar 24 '16 at 21:26

rpy

3,953
2
20
31

Thanks @rpy! can you illustrate with SQL code examples? (or citing contexts by my examples) – Peter Krauss Mar 25 '16 at 15:47
I'm really not convinceed that showing examples will be helpful for an unknown scenario. E.f having an index on `column` will speed a `select max(column)...` query due to index (only) scan. (The only part applies only if there are more columns in the table.) Changing the query slightly to include a condition, e.g. `Select max(column) from table where othercolumn=SOMEVALUE` might cause the index to be useless. Then having 2 separate indices on the two columns or having a combined index on `othercolumn,column` might be needed. – rpy Mar 26 '16 at 15:51
The whole situation will get more complex as soon as joins are introduced into the query. So do not expect a general rule along _whenever you want good performance with SOMEAGGREGATE(), then do the following..._. Look at your query, check query plans, may be, determine distributions of correlated values (those are not available from pg stats) and then start optimizing. – rpy Mar 26 '16 at 15:52
Yes, I agree with the complexity and unpredictability... But I think that a lot of performance is lost, and complexity is introduced by syntax false demand, not "real demand". In the "query pattern" that I showed there are no necessity of many variables in `GROUP BY` clause, and no necessity of `max()`, so, of course, the solution is a build-int `first()` function that do nothing... Today, a good workaround was showed by Erwin, the [first_last_agg extension](http://pgxn.org/dist/first_last_agg/), but I [not tested](https://github.com/wulczer/first_last_agg/issues/2) yet to say something. – Peter Krauss Mar 26 '16 at 19:43
To reply without code, in subjective terms, is difficult to me... Let's try. When you use JSON and other "informal stuff", you perceive that real world need the programmer's intelligence as "predictor" (not the parser or runtime SQL engine)... But the syntax (the language) must accept the programmer's belief (!), the belief that values really will repeat -- so need only a sample, only the `first()` and the variable can be removed from the GROUP BY clause. The language **must offer `first()`** as the simplest tool **for programmer express predictions**... Well PostgreSQL community decide ;-) – Peter Krauss Mar 26 '16 at 19:56
We sem to enter the field of semantic philosophy. In an ideal language, it would be suffiecient to just state what you want to achieve informally. With a given semantic framing and the confines of a given implementation what you may express efficiently is a subset of the full set of statements. – rpy Mar 27 '16 at 00:08
Let's stick with `first()` as an example. If you really want ot express _give me the first value of a sequence of values_ you also have to specify the sequence of values and the ordering that is yielding what first should mean. However, if you actually try to express _give me any esisting value from a set of values fulfilling a set of criteria_ actually is the wrong statement. As there like is no `SAMPLE()` aggregator with any known db impelementation as of today, we transform the question into _give me the first of..._. The system then tries to answer this question, not the original one. – rpy Mar 27 '16 at 00:15
And answering the _wrong_ question also unsurprisingly will fail in optimizing as a human could do - knowing the "correct" question. In reality we regularly need to compensate for _suboptimal_ system functionality. The difficult and even dangerous part is that in most cases we have no exact information about what is already covered by optimizations and what not. (in the first case we will never ouperform the system!) – rpy Mar 27 '16 at 00:19
Thanks @rpy all discussion. About `sample()`, you captured the essence (semantic of the question) and express it in good English... And you going beyond asserting that "any known db impelementation as of today" (!)... Encouraged me to post [the question in more generic terms in another general forum](http://dba.stackexchange.com/questions/133520/aggregate-sampler-function-for-programmer-express-predictions)... Can you help me there to explain better? PS: about your last comment, seems a ["XY problem" set](http://meta.stackexchange.com/a/250659/205446) discussion, but we skirted it ;-) – Peter Krauss Mar 28 '16 at 09:24
Could be a variant of this. Usually this is specification vs implementation abstraction level problem. A solution is specified in the implementation domain, while stating it to be at the specificatin domain. It is then already trying to overcome implementation restrictions taht are not actually part of the problem and cause "better" solutions to become non-obvious. – rpy Mar 28 '16 at 15:12

Best performance in sampling repeated value from a grouped column

2 Answers2

Linked