Fastest way to PostgreSQL Distinct and Format

Question

I have a 3.5 million rows in table acs_objects and I need to retrieve column creation_date with year only format and distinct.

My first attempt : 180~200 Sec (15 Rows Fetched)

SELECT DISTINCT to_char(creation_date,'YYYY') FROM acs_objects

My second attempt : 35~40 Sec (15 Rows Fetched)

SELECT DISTINCT to_char(creation_date,'YYYY')
FROM (SELECT DISTINCT creation_date FROM acs_objects) AS distinct_date

Is there any way to make it faster? -"I need to use this in ADP Website"

Have you already tried [`extract`](http://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-TABLE)? — Andriy M, May 15 '15 at 07:24
@Leonel well you don't 'use' it. But i can imagine if the column is indexed, the sort will be faster, which could boost performance. On my MySQL table (325.000 rows), the performance gain is +/- 20% once indexed. — Roger, May 15 '15 at 07:57
An index on `extract(year FROM creation_date)` would be nice indeed for this specific query and make it lightning fast, but it may negatively impact other uses of the table. — Patrick, May 15 '15 at 08:13
What's the data type of your column `creation_date`? `timestamp`, `timestamptz`, `date`? Can it be `NULL`? Can there be future dates? What's the oldest possible date? What's your version of Postgres? What is "ADP Website" supposed to mean? You have good answers already, but there are faster queries, yet. — Erwin Brandstetter, May 18 '15 at 01:11
The currently accepted answer is more expensive than presented alternatives by a factor of 1000. I think you should reassign, never mind the *wording* of the title. The solution to the presented problem is what matters. — Erwin Brandstetter, May 18 '15 at 13:22
@ErwinBrandstetter The data type is `timestamptz` and It can't be `NULL` and the oldest possible date is `01-01-2010`, Postgres `Version 3.8` and ADP stands for "AOLserver Dynamic Pages ". — Onel Sarmiento, May 19 '15 at 08:12
There is no Postgres version 3.8: http://www.postgresql.org/support/versioning/ What do you get from `SELECT version();`? — Erwin Brandstetter, May 19 '15 at 12:35
@ErwinBrandstetter PostgreSQL 8.1.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.1 20060711 (Red Hat 4.1.1-8) — Onel Sarmiento, May 20 '15 at 02:31
Postgres 8.1 is long dead and forgotten: http://www.postgresql.org/support/versioning/ I strongly suggest you upgrade to a current version. Also, this kind on information *must* be in the question. — Erwin Brandstetter, May 20 '15 at 12:55

valex · Answer 1 · 2015-05-18T06:14:54.373

16

I think you shouldn't select distinct from this huge table. Instead try to generate a short years sequence (say from 1900 to 2100) and select from this sequence only years that exist in acs_objects table. Result set will be the same but I think it will be faster. EXISTS subquery have to run fast on an indexed field creation_date.

SELECT y 
FROM
(
   select generate_series(1900,2100) as y
) as t
WHERE EXISTS (SELECT 1 FROM acs_objects 
                    WHERE creation_date >= DATE (''||t.y||'-01-01')     
                           AND  creation_date < DATE (''||t.y + 1||'-01-01'))

SQLFiddle demo

edited May 18 '15 at 06:14

answered May 15 '15 at 08:20

valex

23,966
7
43
60

Wow, This is amazing 12~15 ms, But the question is about "Distinct and Format" so I can't accept this as an answer but I can give you +1, Thank You! – Onel Sarmiento May 15 '15 at 08:36
3

@Leonel: When you ask a question you should be glad to get answers which think outside of the box and still return the correct answer (of course you can format `y`). There are always multiple ways to get a result in SQL. – dnoeth May 15 '15 at 09:11
4

Great answer, wonderful example of thinking outside of the box! – Frank Heikens May 15 '15 at 09:27
5

Exactly this!!!! In SQL you should always think about what you want to know - You want to know "Which Years are present in this table" - and you know the range of years which are possible. So this is the logical way: Check for each year if there is at least one match. -- I think it would be best, if you could create the Range Dynamic from MIN(year) to MAX(year) in the table, it could even be faster - and some poor programmer doesn't have to change it in 85 years. – Falco May 15 '15 at 12:21
1

`WHERE creation_date >= DATE (''||t.y||'-01-01') AND creation_date < DATE (''||t.y + 1||'-01-01')` would be my choice in cases where I'm not sure whether the column is a date or datetime or, if I know it's a date, whether it can't ever become a datetime. – Andriy M May 16 '15 at 21:57
@AndriyM You're right. I changed this condition in the answer. Thanks. – valex May 18 '15 at 06:16

Patrick · Accepted Answer · 2015-05-15T08:34:41.067

7

In your second attempt you get distinct dates from the sub-query which you then all convert to a string representation and then you select the distinct ones. That is rather inefficient. Better is it to first extract the distinct years from the creation_date in a sub-query and simply cast those to text in the main query:

SELECT year::text
FROM (
  SELECT DISTINCT extract(year FROM creation_date) AS year FROM acs_objects
) AS distinct_years;

If you create an INDEX on the table, the query should run much faster still:

CREATE INDEX really_fast ON acs_objects((extract(year FROM creation_date)));

However, this may impact other uses of your table, in particular if you have many modifying statements (insert, update, delete). And this will only work if creation_date has a data type of date or timestamp (specifically not timestamp with timezone).

The below option looked promising because it does not use a sub-query, but it is in fact much slower (see comments below), probably because the DISTINCT clause is applied on a string:

SELECT DISTINCT extract(year FROM creation_date)::text
FROM acs_objects;

edited May 15 '15 at 08:34

answered May 15 '15 at 07:25

Patrick

29,357
6
62
90

2

Do you have to use a derived table? Would `SELECT DISTINCT extract(year FROM creation_date)::text AS year FROM acs_objects;` produce a different plan? I'm asking because I don't have PostgreSQL installed and can't check myself. – Andriy M May 15 '15 at 07:29
1

I was just writing that up as an edit. Better not to have a sub-query, yes. But the running time difference is probably minimal. – Patrick May 15 '15 at 07:30
Good luck tweaking this one further! I'd be interested to hear what you get with my second option. – Patrick May 15 '15 at 07:34
@Patrick The second one give me 80~90 sec, First one is better. – Onel Sarmiento May 15 '15 at 07:44
That is a big difference indeed. Updated the answer. – Patrick May 15 '15 at 08:08
@Leonel Did you end up creating the index? If so, what is the updated running time? I'd be interested to know how it compares to valex's answer with an index on the year. Perhaps you can compare running times of all the promising answers with and without index as an update to your question to make it a really nice Q&A for future problem solvers. – Patrick May 16 '15 at 00:58

score 4 · Answer 3 · answered May 15 '15 at 08:08

I'm no sure what you use it for. I probably would consider using a using a materialized view.

Now you can refresh the view when needed and have a very fast way to retrieve the (distinct) year list (since the data is basically static stored).

Have a look here:

Erwin Brandstetter · Answer 4 · 2021-05-31T23:13:37.360

Is there any way to make it faster?

Oh yes, much faster. (Updated 2021.)

Basic assessment

If you need this often and fast, and writes to the table are either few or predictable (like: new rows always have current time), a materialized view would be fastest, like @Roger suggested. But you still need a query to implement it. And the queries I am going to suggest are so fast that you might skip the MV ...

In related cases there is often a look-up table with candidate values allowing for a much faster query:

Optimize groupwise maximum query

Assumptions for this case:

Postgres 9.4 or later.
creation_date is data type timestamp (works for date or timestamptz too).
Actual range of timestamps is unknown.
There is a btree index on acs_objects(creation_date).

Emulate loose index scan with rCTE

If you neither have a look-up table nor a derived table with candidate values, there is still a very fast alternative. Basically, you need to emulate an "index skip scan", a.k. a."loose index scan". This query works in any case:

WITH RECURSIVE cte AS (
   SELECT date_trunc('year', max(creation_date)) AS y
   FROM   acs_objects

   UNION ALL
   SELECT (SELECT date_trunc('year', max(creation_date))
           FROM   acs_objects
           WHERE  creation_date < cte.y)
   FROM   cte
   WHERE  cte.y IS NOT NULL
   )
SELECT to_char(y, 'YYYY') AS year
FROM   cte
WHERE  cte.y IS NOT NULL;

Probably fastest: top-down, and truncate each timestamp to the start of the year, then find the latest of the earlier rows; repeat.

Details for this technique:

Based on `generate_series()`

valex's idea can be implemented more efficiently with generate_series() producing timestamp values based on the actual range of existing years:

SELECT to_char(y, 'YYYY') AS year
FROM  (
   SELECT generate_series(date_trunc('year', min(creation_date))
                        , max(creation_date)
                        , interval  '1 year')
   FROM   acs_objects
   ) t(y)
WHERE  EXISTS (
   SELECT FROM acs_objects 
   WHERE creation_date >= y
   AND   creation_date <  y + interval '1 year'
   );

db<>fiddle here demonstrating both.
_{Old slfiddle}

If there are few gaps in the range of years, this may be faster, yet. But either should only take a few milliseconds or less, regardless of the table size.

+1. I think using @Falco comment to my answer about use `MIN(year)` and `MAX(year)` from the table instead of 1900 and 2020) and your `generate_series` will run even faster. But I guess it already depends on the years distribution in the table. — valex, May 18 '15 at 06:40
@valex: Yes, a range from min to max is cleaner. I added a link. There can still be major gaps in the range of values, which would increase the cost. I would use the rCTE which works in any case - slightly slower than with a perfect look-up table, but faster than with a derived table providing more than a few irrelevant values. — Erwin Brandstetter, May 18 '15 at 13:18

Fastest way to PostgreSQL Distinct and Format

4 Answers4

Basic assessment

Emulate loose index scan with rCTE

Based on generate_series()

Based on `generate_series()`