1

I'm having a problem with a postgresql table. The table has 14 columns, 2 of them that are indexed as following:

       Column        | Type | Modifiers 
---------------------+------+-----------
 u_id                | text | not null
 p_date              | date | not null
...

Indexes:
    "unq_uid_pdate" UNIQUE CONSTRAINT, btree (u_id, p_date)
    "pdate_idx" btree (p_date)

The size of the table is around 5 billion rows, with a total of around 10 million distinct values in u_id, and each u_id has around 500 distinct values in p_date on average.

When I try to export by u_id, or without any filters, the performance is good reaching speeds of up to 400K rows per second without significant drops over time.

psql "..." -c "\\copy (SELECT u_id, p_date FROM l_nights) to stdout with csv header" | pv --line-mode --rate > /dev/null

[ 468k/s]
[ 485k/s]
[ 497k/s]
[ 435k/s]

However, if I want to export the approximately 10 million rows of a specific p_date, the performance is very poor, starting at few thousand rows per second, and reaching a stall in few minutes. Adding limit the behaviour is the same:

psql "..." -c "\\copy (SELECT u_id, p_date FROM l_nights where p_date = '2020-06-01' limit 50000) to stdout with csv header" | pv --line-mode --rate > /dev/null

[1.09k/s]
[1.28k/s]
[1.48k/s]
[3.01k/s]
[ 954 /s]
[1.40k/s]
[ 918 /s]
[ 849 /s]
[ 383 /s]
[ 364 /s]
[ 358 /s]
[0.00 /s]
[ 170 /s]
[0.00 /s]

The query is using the proper index, based on the explain response:

explain analyse select u_id, p_date from l_nights where p_date = '2020-06-01' limit 50000;

QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.72..158646.09 rows=50000 width=15) (actual time=9.672..207430.728 rows=50000 loops=1)
   ->  Index Scan using pdate_idx on l_nights  (cost=0.72..233913736.58 rows=73722207 width=15) (actual time=9.671..207396.204 rows=50000 loops=1)
         Index Cond: (p_date = '2020-06-01'::date)
 Planning Time: 0.182 ms
 Execution Time: 207451.360 ms

Based on the current performance, it seems to me like the search was doing a full scan, even though the column is indexed. Other export strategies like getting a postgresql cursor in node.js also behave the same.

What am I doing wrong?

nsblenin
  • 237
  • 3
  • 12
  • 1
    Maybe an index on `(p_date, u_id)` might be better. You can try that. But I don't know if the retrieval is really the problem here. The bottleneck might be the writing to the export file. – sticky bit Sep 02 '20 at 10:56
  • 1
    The LIMITed query is exporting 50,000 rows per second. You need to capture the plan for the query which exhibits the problem, not a related query which doesn't. 'auto_explain' can help with that. – jjanes Sep 02 '20 at 13:31
  • With LIMIT, the behaviour is the same, but 1000 rows is not enough to perceive it. I tried with 50K rows of limit and got very poor performance as well. I update the response to make it clear. Thanks. – nsblenin Sep 02 '20 at 16:04
  • I agree with sticky an `(p_date, u_id)` will be better, (I ll explain why I think so, if you want), but could you provide your data type for `p_date` ? – hsibboni Sep 05 '20 at 23:57
  • @hsibboni I added the data types of both column in the description. Why do you think a multicolumn index is better? If it's because it can obtain the fields directly from the index, I also tried to select only the p_date field, and the speed is the same. Thanks – nsblenin Sep 07 '20 at 09:30
  • Yes, I was thinking about [covering index](https://wiki.postgresql.org/wiki/Index-only_scans#Covering_indexes) added in 9.2, [also described in the documentation](https://www.postgresql.org/docs/11/indexes-index-only-scans.html), that makes an index scan a possible solution. You can use the `INCLUDE` clause when creating such index. – hsibboni Sep 07 '20 at 11:16
  • But if there are too many rows, [postgresql will still do a bitmap index scan or a sequential scan](https://stackoverflow.com/a/34584053/7473172) see also [this](https://stackoverflow.com/a/24760433/7473172). Also if [the statistics are too old](https://stackoverflow.com/a/8229000/7473172). Or if the pages are not visible (in the docs too, search visibility map). – hsibboni Sep 07 '20 at 11:16
  • If nothing works, and considering you have a specific use case, you can try adding partitions to your table with a range on the date. This is just a list of links, if you want I can try writing something closer to an answer? – hsibboni Sep 07 '20 at 11:21

0 Answers0