Get values appearing at least N times in a table quickly

Question

I have a Postgres 10.10 database with a table of more than 6 million rows and the following definition:

create table users (
    id              bigserial primary key,
    user_id         text      unique,
    username        text,
    first_name      text,
    last_name       text,
    language_code   text,
    gender          text,
    first_seen      timestamp with time zone,
    last_seen       timestamp with time zone,
    search_language text,
    age             text
);

create index users_language_code_idx on users (language_code);
create index users_last_seen_idx     on users (last_seen);
create index users_first_seen_idx1   on users (first_seen);
create index users_age_idx           on users (age);
create index users_last_seen_age_idx on users (last_seen, age);

And I have a query to fetch popular language codes with more than 100 users:

SELECT language_code FROM users
GROUP BY language_code
HAVING count(*) > 100;

At some point this query started to take a huge amount of time to finish (~10 minutes). Btree index on language_code didn't help. What else can I do to improve the performance?

Here's explain analyze output:

https://explain.depesz.com/s/j2ga

Finalize GroupAggregate  (cost=7539479.67..7539480.34 rows=27 width=3) (actual time=620744.389..620744.458 rows=24 loops=1)
  Group Key: language_code
  Filter: (count(*) > 100)
  Rows Removed by Filter: 60
  ->  Sort  (cost=7539479.67..7539479.80 rows=54 width=11) (actual time=620744.359..620744.372 rows=84 loops=1)
        Sort Key: language_code
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=7539472.44..7539478.11 rows=54 width=11) (actual time=620744.038..620744.727 rows=84 loops=1)
              Workers Planned: 2
              Workers Launched: 0
              ->  Partial HashAggregate  (cost=7538472.44..7538472.71 rows=27 width=11) (actual time=620743.596..620743.633 rows=84 loops=1)
                    Group Key: language_code
                    ->  Parallel Seq Scan on users  (cost=0.00..7525174.96 rows=2659496 width=3) (actual time=0.377..616632.155 rows=6334894 loops=1)
Planning time: 0.194 ms
Execution time: 620745.276 ms

Do you have a table `language_code` with a distinct set of language codes? (Typically you should have that.) Do you have an index on `users(language_code)`? Consider instructions for PostgreSQL performance questions: https://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Feb 17 '21 at 12:05
@ErwinBrandstetter Hello, I do have an index on `users(language_code)` and I don't have a table with distinct `language_codes` — qwerty qwerty, Feb 17 '21 at 12:08
Oh, and please provide your table structure as *text*, not as image. Ideally, a valid `CREATE TABLE` script. Never images for text. — Erwin Brandstetter, Feb 17 '21 at 12:10
@ErwinBrandstetter Sorry, added table structure in text + create table script — qwerty qwerty, Feb 17 '21 at 12:14

Erwin Brandstetter · Accepted Answer · 2021-02-17T13:24:25.917

You can make good use of the index on (language_code) with an emulated index skip scan:

WITH RECURSIVE cte AS (
   SELECT min(language_code) AS language_code
   FROM   users
   
   UNION ALL
   SELECT (SELECT language_code
           FROM   users
           WHERE  language_code > c.language_code
           ORDER  BY language_code
           LIMIT  1)
   FROM   cte c
   WHERE  c.language_code IS NOT NULL
   )
SELECT language_code
FROM   cte c
JOIN   LATERAL (
   SELECT count(*) AS ct
   FROM  (
      SELECT -- can stay empty
      FROM   users
      WHERE  language_code = c.language_code 
      LIMIT  101
      ) sub
   ) u ON ct > 100  -- "more than 100"
WHERE  language_code IS NOT NULL;

db<>fiddle here

Given your numbers (6M rows, but only a hand full of distinct language codes), this should perform faster by orders of magnitude.

The first part - the recursive CTE (rCTE) named cte - produces the set of distinct language_code in the table (except NULL). A table with distinct language codes could replace that part to be faster, yet. (It might be a good idea to maintain such a table and enforce referential integrity with a FK constraint to it ...)

The second part only looks at a maximum of 101 rows (your threshold) per language code. This way we avoid the expensive sequential scan over the whole big table.

If your table is "vacuumed" enough, you should see index-only scans exclusively.

Upgrading to the current version Postgres 13 should help some more due to the newly introduced index deduplication that should make said index substantially smaller (as it's highly duplicative).

Sadly, automatic index skip scans did not make it into version 13. Maybe Postgres 14. But above emulation should be almost as good.

Further reading (with detailed explanation for above query technique):

Wow, it works like magic, huge thanks for the answer for the formatting my questoion! — qwerty qwerty, Feb 17 '21 at 12:58

Get values appearing at least N times in a table quickly

1 Answers1