I'd like to join IP routing table information to IP whois information. I'm using Amazon's RDS which means I can't use the Postgres ip4r extension, and so I am instead using int8range types to represent the IP address ranges, with gist indexes.
My tables look like this:
=> \d routing_details
Table "public.routing_details"
Column | Type | Modifiers
----------+-----------+-----------
asn | text |
netblock | text |
range | int8range |
Indexes:
"idx_routing_details_netblock" btree (netblock)
"idx_routing_details_range" gist (range)
=> \d netblock_details
Table "public.netblock_details"
Column | Type | Modifiers
------------+-----------+-----------
range | int8range |
name | text |
country | text |
source | text |
Indexes:
"idx_netblock_details_range" gist (range)
The full routing_details table contains just under 600K rows, and netblock_details contains around 8.25M rows. There are overlapping ranges in both tables, but for each range in the routing_details table I want to get the single best (smallest) match from the netblock_details table.
I came up with 2 different queries that I think will return the accurate data, one using window functions and one using DISTINCT ON:
EXPLAIN SELECT DISTINCT ON (r.netblock) *
FROM routing_details r JOIN netblock_details n ON r.range <@ n.range
ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Unique (cost=118452809778.47..118477166326.22 rows=581300 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, ((upper(n.range) - lower(n.range)))
-> Sort (cost=118452809778.47..118464988052.34 rows=4871309551 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, ((upper(n.range) - lower(n.range)))
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
-> Nested Loop (cost=0.00..115920727265.53 rows=4871309551 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, (upper(n.range) - lower(n.range))
Join Filter: (r.range <@ n.range)
-> Seq Scan on public.routing_details r (cost=0.00..11458.96 rows=592496 width=43)
Output: r.asn, r.netblock, r.range
-> Materialize (cost=0.00..277082.12 rows=8221675 width=48)
Output: n.range, n.name, n.country
-> Seq Scan on public.netblock_details n (cost=0.00..163712.75 rows=8221675 width=48)
Output: n.range, n.name, n.country
(14 rows) -> Seq Scan on netblock_details n (cost=0.00..163712.75 rows=8221675 width=48)
EXPLAIN VERBOSE SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY r.range ORDER BY UPPER(n.range) - LOWER(n.range)) AS rank
FROM routing_details r JOIN netblock_details n ON r.range <@ n.range
) a WHERE rank = 1 ORDER BY netblock;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=118620775630.16..118620836521.53 rows=24356548 width=99)
Output: a.asn, a.netblock, a.range, a.range_1, a.name, a.country, a.rank
Sort Key: a.netblock
-> Subquery Scan on a (cost=118416274956.83..118611127338.87 rows=24356548 width=99)
Output: a.asn, a.netblock, a.range, a.range_1, a.name, a.country, a.rank
Filter: (a.rank = 1)
-> WindowAgg (cost=118416274956.83..118550235969.49 rows=4871309551 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, row_number() OVER (?), ((upper(n.range) - lower(n.range))), r.range
-> Sort (cost=118416274956.83..118428453230.71 rows=4871309551 width=91)
Output: ((upper(n.range) - lower(n.range))), r.range, r.asn, r.netblock, n.range, n.name, n.country
Sort Key: r.range, ((upper(n.range) - lower(n.range)))
-> Nested Loop (cost=0.00..115884192443.90 rows=4871309551 width=91)
Output: (upper(n.range) - lower(n.range)), r.range, r.asn, r.netblock, n.range, n.name, n.country
Join Filter: (r.range <@ n.range)
-> Seq Scan on public.routing_details r (cost=0.00..11458.96 rows=592496 width=43)
Output: r.asn, r.netblock, r.range
-> Materialize (cost=0.00..277082.12 rows=8221675 width=48)
Output: n.range, n.name, n.country
-> Seq Scan on public.netblock_details n (cost=0.00..163712.75 rows=8221675 width=48)
Output: n.range, n.name, n.country
(20 rows)
The DISTINCT ON seems slightly more efficient, so I've proceeded with that one. When I run the query against the full dataset I get an out of disk space error after around a 24h wait. I've created a routing_details_small table with a subset of N rows of the full routing_details table to try and understand what's going on.
With N=1000
=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=4411888.68..4453012.20 rows=999 width=90) (actual time=124.094..133.720 rows=999 loops=1)
-> Sort (cost=4411888.68..4432450.44 rows=8224705 width=90) (actual time=124.091..128.560 rows=4172 loops=1)
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
Sort Method: external sort Disk: 608kB
-> Nested Loop (cost=0.41..1780498.29 rows=8224705 width=90) (actual time=0.080..101.518 rows=4172 loops=1)
-> Seq Scan on routing_details_small r (cost=0.00..20.00 rows=1000 width=42) (actual time=0.007..1.037 rows=1000 loops=1)
-> Index Scan using idx_netblock_details_range on netblock_details n (cost=0.41..1307.55 rows=41124 width=48) (actual time=0.063..0.089 rows=4 loops=1000)
Index Cond: (r.range <@ range)
Total runtime: 134.999 ms
(9 rows)
With N=100000
=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=654922588.98..659034941.48 rows=200 width=144) (actual time=28252.677..29487.380 rows=98992 loops=1)
-> Sort (cost=654922588.98..656978765.23 rows=822470500 width=144) (actual time=28252.673..28926.703 rows=454856 loops=1)
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
Sort Method: external merge Disk: 64488kB
-> Nested Loop (cost=0.41..119890431.75 rows=822470500 width=144) (actual time=0.079..24951.038 rows=454856 loops=1)
-> Seq Scan on routing_details_small r (cost=0.00..1935.00 rows=100000 width=96) (actual time=0.007..110.457 rows=100000 loops=1)
-> Index Scan using idx_netblock_details_range on netblock_details n (cost=0.41..725.96 rows=41124 width=48) (actual time=0.067..0.235 rows=5 loops=100000)
Index Cond: (r.range <@ range)
Total runtime: 29596.667 ms
(9 rows)
With N=250000
=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=1651822953.55..1662103834.80 rows=200 width=144) (actual time=185835.443..190143.266 rows=247655 loops=1)
-> Sort (cost=1651822953.55..1656963394.18 rows=2056176250 width=144) (actual time=185835.439..188779.279 rows=1103850 loops=1)
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
Sort Method: external merge Disk: 155288kB
-> Nested Loop (cost=0.28..300651962.46 rows=2056176250 width=144) (actual time=19.325..177403.913 rows=1103850 loops=1)
-> Seq Scan on netblock_details n (cost=0.00..163743.05 rows=8224705 width=48) (actual time=0.007..8160.346 rows=8224705 loops=1)
-> Index Scan using idx_routing_details_small_range on routing_details_small r (cost=0.28..22.16 rows=1250 width=96) (actual time=0.018..0.018 rows=0 loops=8224705)
Index Cond: (range <@ n.range)
Total runtime: 190413.912 ms
(9 rows)
Against the full table with 600k rows the query fails after around 24h with an error about running out of disk space, which is presumably caused by the external merge step. So this query is working well and very quickly with a small routing_details table, but is scaling very poorly.
Suggestions for how to improve my query, or perhaps even schema changes I could make so that this query will work efficiently on the full dataset?