One of the claims Neo4j makes in their marketing is that relational databases aren't good at doing multi-level self-join queries:
I found the code repository corresponding to the book that claim is taken from, and translated it to Postgres:
CREATE TABLE t_user (
id bigserial PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE t_user_friend (
id bigserial PRIMARY KEY,
user_1 bigint NOT NULL REFERENCES t_user,
user_2 bigint NOT NULL REFERENCES t_user
);
CREATE INDEX idx_user_friend_user_1 ON t_user_friend (user_1);
CREATE INDEX idx_user_friend_user_2 ON t_user_friend (user_2);
/* Create 1M users, each getting a random 10-character name */
INSERT INTO t_user (id, name)
SELECT x.id, substr(md5(random()::text), 0, 10)
FROM generate_series(1,1000000) AS x(id);
/* For each user, create 50 random friendships for a total of 50M friendship records */
INSERT INTO t_user_friend (user_1, user_2)
SELECT g1.x AS user_1, (1 + (random() * 999999)) :: int AS user_2
FROM generate_series(1, 1000000) as g1(x), generate_series(1, 50) as g2(y);
And these are the queries at various depths Neo4j is comparing against:
/* Depth 2 */
SELECT
COUNT(DISTINCT f2.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
WHERE
f1.user_1 = 1;
/* Depth 3 */
SELECT
COUNT(DISTINCT f3.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
INNER JOIN
t_user_friend f3
ON f2.user_2 = f3.user_1
WHERE
f1.user_1 = 1;
/* Depth 4 */
SELECT
COUNT(DISTINCT f4.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
INNER JOIN
t_user_friend f3
ON f2.user_2 = f3.user_1
INNER JOIN
t_user_friend f4
ON f3.user_2 = f4.user_1
WHERE
f1.user_1 = 1;
/* Depth 5 */
SELECT
COUNT(DISTINCT f5.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
INNER JOIN
t_user_friend f3
ON f2.user_2 = f3.user_1
INNER JOIN
t_user_friend f4
ON f3.user_2 = f4.user_1
INNER JOIN
t_user_friend f5
ON f4.user_2 = f5.user_1
WHERE
f1.user_1 = 1;
I was roughly able to reproduce the book's claimed results, getting these sorts of execution times against the 1M users, 50M friendships:
| Depth | Count(*) | Time (s) |
|-------|----------|----------|
| 2 | 2497 | 0.067 |
| 3 | 117301 | 0.118 |
| 4 | 997246 | 8.409 |
| 5 | 999999 | 214.56 |
(Here's an EXPLAIN ANALYZE of a depth 5 query)
My question is, is there a way to improve the performance of these queries to meet or exceed Neo4j's execution time of ~2s at depth level 5?
I tried with this recursive CTE:
WITH RECURSIVE chain(user_2, depth) AS (
SELECT t.user_2, 1 as depth
FROM t_user_friend t
WHERE t.user_1 = 1
UNION
SELECT t.user_2, c.depth + 1 as depth
FROM t_user_friend t, chain c
WHERE t.user_1 = c.user_2
AND depth < 4
)
SELECT COUNT(*)
FROM (SELECT DISTINCT user_2 FROM chain) AS temp;
However it's still pretty slow, with a depth 4 taking 5s and a depth 5 taking 48s (EXPLAIN ANALYZE)