PostgreSQL graph neighbors query slow

Question

EDIT In my original question I noticed a difference between searching for neighbors using a JOIN and using a WHERE .. IN clause, which @LukaszSzozda rightfully pointed out is a semi-join. It turns out my node list had duplicates, which explains why the JOIN took longer to run. Thanks, @LukaszSzozda. The more important aspect of my question remains, though, which is what is brought below. UPDATE I added the relevant configuration options to the bottom, and updated statistics using ANALYZE (thanks to @joop). Also, I tested with three different indices (B-Tree, hash, BRIN). Finally, I noticed that using different queries returned different number of rows into tmp_nodes, possibly because of different ordering, so I fixed it to a constant set of rather-random 8,000 nodes .

In PostgreSQL, my query to search for neighbors of 8,000 nodes among ~200*10⁶ nodes (within ~1.3*10⁹ edges) is slow (~30 seconds using hash index; see index benchmarking below).

Given the setup I describe below, are there further optimizations for my server software, database, table or query to make the neighbor search faster? I am particularly surprised at this speed considering how well PostgreSQL did on the ArangoDB NoSQL benchmark.

More specifically:

I am aware of AgnesGraph, but do not wish yet to move to a graph-database solution, specifically since I cannot tell from AgnesGraph how well it keeps up-to-date with PostgreSQL. Can someone explain the performance benefits with regard to how the query actually happens in AgnesGraph vs PostgreSQL, so that I can decide whether to migrate?
Are there any configuration tweaks, whether in the server or the OS, that affect my query according to the plan, which cause it to run for longer than needed?

Set up

I have a large graph database ( ~10⁹ edges, ~200*10⁶ nodes) in PostgreSQL (PostgreSQL 10.1, which I had to pull from the zesty PPA) stored on the cloud (DigitalOcean, 6-core, 16GB RAM machine, Ubuntu 17.10, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz), and set up with parameters suggested by PGTune (see bottom). I am querying on-server.

I have created forward- and backward-edges tables (see this question)

CREATE TABLE edges_fwd (src BIGINT, dest BIGINT, PRIMARY KEY (src, dest));
CREATE TABLE edges_back (src BIGINT, dest BIGINT, PRIMARY KEY (dest, src));

and clustered both by the respective keys (just in case):

CLUSTER edges_fwd USING edges_fwd_pkey;
CLUSTER edges_back USING edges_back_pkey;

I turned off enabled_seqscan for the purpose of testing my queries (see side note below).

I would like to load all the out-edges for 8,000 nodes (which 8,000 nodes these are can change depending on user query), whose identifiers are list in a table tmp_nodes (with a single column, nid). I initially wrote this version on the query (patting myself in the back on already following the lines of the graph talk from PGCon11):

SELECT e.*  
  FROM tmp_nodes
  JOIN edges AS e
    ON e.src = tmp_nodes.nid;

I also tried:

SELECT * FROM edges_fwd AS e
 WHERE e.src IN (SELECT nid FROM tmp_nodes);

They both are slow, and take about 30 seconds to run at best (using hash indicies). EXPLAIN ANALYZE outputs are brought below.

I expected things to generally run much faster. For looking for 8,000 keys in a clustered table (yes, I know it's not really a clustered index), since the server knows that the rows are ordered, I should expect less page reads than the number of total rows returned. So while 243,708 rows are fetched, which isn't a little, they are associated with 8,000 distinct keys, and the number of reads should not be much larger than that: it's an average of 30 rows per key, which is about 1,400 bytes per read (the table size is 56GB and has 1.3B rows, so it's about 46 bytes per row; which by the way is quite a bloat for 16 bytes of data). This is far below the page size (4K) for the system. I didn't think reading 8,000 pages, even random-access, should take this long.

This brings me back to my questions (above).

Forcing index usage

I took advice from answers to another question and at least for testing (though, since my database is read-only, I might be tempted to use it in production), set enable_seqscan to off, in order to force index usage. I ran each 5 times - the times varied by a few seconds here and there.

`EXPLAIN ANALYZE` outputs

Taking care to flush OS disk cached and restart the server to reflect correct random-seek timings, I used EXPLAIN ANALYZE on both queries. I used two types of indexes - B-Tree and hash. I also tried BRIN with different values for the pages_per_range option (2, 8, 32 and 128), but they are all slower (in orders or magnitude) than those mentioned above. I am giving the results below for reference.

B-Tree index, `JOIN` query:

Nested Loop  (cost=10000000000.58..10025160709.50 rows=15783833 width=16) (actual time=4.546..39152.408 rows=243708 loops=1)
  ->  Seq Scan on tmp_nodes  (cost=10000000000.00..10000000116.00 rows=8000 width=8) (actual time=0.712..15.721 rows=8000 loops=1)
  ->  Index Only Scan using edges_fwd_pkey on edges_fwd e  (cost=0.58..3125.34 rows=1973 width=16) (actual time=4.565..4.879 rows=30 loops=8000)
        Index Cond: (src = tmp_nodes.nid)
        Heap Fetches: 243708
Planning time: 20.962 ms
Execution time: 39175.454 ms

B-Tree index, `WHERE .. IN` query (semi-join):

Nested Loop  (cost=10000000136.58..10025160809.50 rows=15783833 width=16) (actual time=9.578..42605.783 rows=243708 loops=1)
  ->  HashAggregate  (cost=10000000136.00..10000000216.00 rows=8000 width=8) (actual time=5.903..35.750 rows=8000 loops=1)
        Group Key: tmp_nodes.nid
        ->  Seq Scan on tmp_nodes  (cost=10000000000.00..10000000116.00 rows=8000 width=8) (actual time=0.722..2.695 rows=8000 loops=1
)
  ->  Index Only Scan using edges_fwd_pkey on edged_fwd e  (cost=0.58..3125.34 rows=1973 width=16) (actual time=4.924..5.309 rows=30 loops=8000)
        Index Cond: (src = tmp_nodes.nid)
        Heap Fetches: 243708
Planning time: 19.126 ms
Execution time: 42629.084 ms

Hash index, `JOIN` query:

Nested Loop  (cost=10000000051.08..10056052287.01 rows=15783833 width=16) (actual time=3.710..34131.371 rows=243708 loops=1)
  ->  Seq Scan on tmp_nodes  (cost=10000000000.00..10000000116.00 rows=8000 width=8) (actual time=0.960..13.338 rows=8000 loops=1)
  ->  Bitmap Heap Scan on edges_fwd e  (cost=51.08..6986.79 rows=1973 width=16) (actual time=4.086..4.250 rows=30 loops=8000)
        Heap Blocks: exact=8094
        ->  Bitmap Index Scan on ix_edges_fwd_src_hash  (cost=0.00..50.58 rows=1973 width=0) (actual time=2.563..2.563 rows=31

loops=8000) Execution time: 34155.511 ms

Hash index, `WHERE .. IN` query (semi-join):

Nested Loop  (cost=10000000187.08..10056052387.01 rows=15783833 width=16) (actual time=12.766..31834.767 rows=243708 loops=1)
  ->  HashAggregate  (cost=10000000136.00..10000000216.00 rows=8000 width=8) (actual time=6.297..30.760 rows=8000 loops=1)
        ->  Seq Scan on tmp_nodes  (cost=10000000000.00..10000000116.00 rows=8000 width=8) (actual time=0.883..3.108 rows=8000 loops=$

) -> Bitmap Heap Scan on edges_fwd e (cost=51.08..6986.79 rows=1973 width=16) (actual time=3.768..3.958 rows=30 loops=8000) Heap Blocks: exact=8094 -> Bitmap Index Scan on ix_edges_fwd_src_hash (cost=0.00..50.58 rows=1973 width=0) (actual time=2.340..2.340 rows=31 loops=8000) Execution time: 31857.692 ms

`postgresql.conf` settings

I set the following configuration options as suggested by PGTune:

max_connections = 10
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 4
effective_io_concurrency = 2
work_mem = 69905kB
min_wal_size = 4GB
max_wal_size = 8GB
max_worker_processes = 6
max_parallel_workers_per_gather = 3
max_parallel_workers = 6

First of all please note that these queries **are not** necessarily equivalent. You are mixing `JOIN` with `SEMI JOIN`(checking for existence). Second link you provided [Is a JOIN faster than a WHERE?](https://stackoverflow.com/questions/1129923/is-a-join-faster-than-a-where) is about different case (explicit JOIN vs old style JOIN and it still holds). — Lukasz Szozda, Sep 22 '18 at 05:19
Related article about the fact the queries are not logically the same: [Semi Join and Anti Join Should Have Their Own Syntax in SQL](https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql/) — Lukasz Szozda, Sep 22 '18 at 05:30
@LukaszSzozda I understand that using `IN` is analogous to a `SEMI JOIN`, however this does not answer the question of why it would be slower, especially as `tmp_nodes` is a small table, which should be in-memory, and which has a *single column*. — Yuval, Sep 22 '18 at 07:09
@LukaszSzozda and I forgot to say: thank you for your comment. — Yuval, Sep 22 '18 at 07:17
@LukaszSzozda I actually owe you another thanks since it turns out that the queries were very different because I hadn't taken care to remove duplicated from `tmp_edges`. This became evident once I looked at the number of rows returned by each query. Thank you for leading me in this direction. I changed the question to reflect this. — Yuval, Sep 22 '18 at 10:27
Instead of the link to PGtune, you could add the actual (relevant) settings to the question. Also: your statistics appear to be absent or wrong. — joop, Sep 24 '18 at 11:02
Thanks @joop for the suggestion. I added configuration options, and updated statistics using `ANALYZE`. The results seem more or less the same. I have finally done what was expected of me, I understand, and check how quick random access actually is. It turns out it is rather slow, so my assumptions were incorrect - as often happens. I will post my answer soon. — Yuval, Sep 24 '18 at 16:20
@wildplasser My problem remained constant. The only thing I removed was the JOIN vs SEMI JOIN as it is listed in the edit on top. The problematic query remained identical throughout. Don't tell me to go away. — Yuval, Sep 26 '18 at 18:30

wildplasser · Answer 1 · 2018-09-22T17:51:46.033

You also need indexes to search in the reversed direction:

CREATE TABLE edges_fwd 
        (src BIGINT
        , dest BIGINT
        , PRIMARY KEY (src, dest)
        );
CREATE UNIQUE INDEX ON edges_fwd(dest, src);

CREATE TABLE edges_back
        (src BIGINT
        , dest BIGINT
        , PRIMARY KEY (dest, src)
        );
CREATE UNIQUE INDEX ON edges_back(src, dest);

SELECT fwd.*
  FROM edges_back AS bck
  JOIN edges_fwd AS fwd
    ON fwd.src = bck.src        -- bck.src does not have a usable index
 WHERE bck.dest = root_id;

The absence of this index causes the hashjoin (or: tablescan)

Also, you could maybe combine the two tables.

Also, you can force the src and dest columns to be NOT NULL (a null would make no sense in a edges table) , and make them FOREIGN KEYs to your nodes table:

CREATE TABLE nodes
        (nid BIGINT NOT NULL PRIMARY KEY
        -- ... more stuff...
        );

CREATE TABLE edges_fwd
        (src BIGINT NOT NULL REFERENCES nodes(nid)
        , dest BIGINT NOT NULL REFERENCES nodes(nid)
        , PRIMARY KEY (src, dest)
        );

CREATE TABLE edges_back
        (src BIGINT NOT NULL REFERENCES nodes(nid)
        , dest BIGINT NOT NULL REFERENCES nodes(nid)
        , PRIMARY KEY (dest, src)
        );

INSERT INTO nodes(nid)
SELECT a
FROM generate_series(1,1000) a -- 1000 rows
        ;

INSERT INTO edges_fwd(src, dest)
SELECT a.nid, b.nid
FROM nodes a
JOIN nodes b ON random()< 0.1 --100K rows
        ;

INSERT INTO edges_back(src, dest)
SELECT a.nid, b.nid
FROM nodes a
JOIN nodes b ON random()< 0.1 --100K rows
        ;

This will result in this plan:

DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
CREATE TABLE
CREATE TABLE
INSERT 0 1000
INSERT 0 99298
INSERT 0 99671
ANALYZE
ANALYZE
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.50..677.62 rows=9620 width=16) (actual time=0.086..5.299 rows=9630 loops=1)
   ->  Index Only Scan using edges_back_pkey on edges_back bck  (cost=0.25..100.07 rows=97 width=8) (actual time=0.053..0.194 rows=96 loops=1)
         Index Cond: (dest = 11)
         Heap Fetches: 96
   ->  Index Only Scan using edges_fwd_pkey on edges_fwd fwd  (cost=0.25..5.46 rows=99 width=16) (actual time=0.008..0.037 rows=100 loops=96)
         Index Cond: (src = bck.src)
         Heap Fetches: 9630
 Planning time: 0.480 ms
 Execution time: 5.836 ms
(9 rows)

Have you not read the question? `edges_back` has a `PRIMARY KEY` defined on it which is a unique index. And it is clustered according to it. — Yuval, Sep 22 '18 at 17:34
Yes, indeed, the PK will force NOT NULL.But it willnot force the values to actually exist in the nodes table. — wildplasser, Sep 22 '18 at 17:55

score 0 · Answer 2 · answered Sep 24 '18 at 16:26

It seems that random-access for this kind of setup is just this slow. Running a script to check random-access of 8,000 different, random 4K blocks within a large file takes nearly 30 seconds. Using Linux time and the linked script, I get in average something like 24 seconds:

File size: 8586524825 Read size: 4096
32768000 bytes read

real    0m24.076s

So it seems the assumption that random access should be quicker is wrong. Together with time taken to read the actual index, it means performance is at its peak without a hardware change. To improve performance, I will likely need to use a RAID set-up or a cluster. If a RAID set-up will improve performance in a close-to-linear fashion, I will accept my own answer.

Try setting `SET random_page_cost = 1.1;` and check the difference. Given the scarcity of the data, this will probably result in a index-only plan, and cause less I/O traffic [but more seeks] — wildplasser, Sep 24 '18 at 23:43

PostgreSQL graph neighbors query slow

Set up

Forcing index usage

EXPLAIN ANALYZE outputs

B-Tree index, JOIN query:

B-Tree index, WHERE .. IN query (semi-join):

Hash index, JOIN query:

Hash index, WHERE .. IN query (semi-join):

postgresql.conf settings

2 Answers2