I need to filter a large database 2 billion entries using python and sql

Question

The data is in bytea format, how do I query the postgresql database, its is indexed in the bytea column i need to query by the first 4 bytes. I have tried

SELECT * FROM table WHERE addr LIKE '%8ac5c320____'

but it takes too long to find. Any suggestions? if I query the whole string then it works fast, but there are about 2 billion entries and i cant use wild cards...

First thing to check if a DB query "takes too long" is if there is a matching index. Try an `EXPLAIN` on your query. — Klaus D., Sep 20 '17 at 20:51
Leading wildcards are not deterministic which will greatly impact index efficiency. — Kyle, Sep 20 '17 at 20:56
@Kyle There are [ways around that](https://stackoverflow.com/a/13452528/479863) though. — mu is too short, Sep 20 '17 at 21:17

score 1 · Answer 1 · answered Sep 20 '17 at 21:53

1

To get matches based on the first four bytes, I'd recommend the following query:

SELECT * FROM table WHERE substring(addr from 0 for 5) = '\x8ac5c320'::bytea;

The documentation for substring can be found on the bytea functions page, though that's admittedly minimal.

The query as written will likely perform a sequential scan across the entire table. To remedy that, create the following index:

CREATE INDEX ON table (substring(addr from 0 for 5));

That creates an index specifically designed for the query you need to run frequently. It's a functional index -- it's indexing a function result, rather than a column.

That should get you the performance that you want.

All that said, though, your example query does not actually query for the first four bytes. If the query is more correct than your description of the query, then this approach won't work.

answered Sep 20 '17 at 21:53

jmelesky

3,710
1
24
24

explain SELECT * FROM table WHERE substring(addr from 0 for 5) = '\x8ac5c320'::bytea; QUERY PLAN ------------------------------------------------------------------------------------------------- Bitmap Heap Scan on table (cost=160844.32..17035395.99 rows=8592999 width=71) Recheck Cond: ("substring"(addr, 0, 5) = '\x8ac5c320'::bytea) -> Bitmap Index Scan on table_substring_idx (cost=0.00..158696.07 rows=8592999 width=0) Index Cond: ("substring"(addr, 0, 5) = '\x8ac5c320'::bytea) – Roger Sep 22 '17 at 07:51
This is taking too long to execute is there a reason for that? – Roger Sep 22 '17 at 07:51
Taking about 30 secs to 1 min to find, is there a faster way? – Roger Sep 22 '17 at 10:23
It's impossible to say for certain with the small amount of information provided. It may be that it's just taking significantly longer to pull 8.5million rows than to pull a single row (which makes sense). You may want to read https://wiki.postgresql.org/wiki/SlowQueryQuestions and post a new question. – jmelesky Sep 22 '17 at 10:55

I need to filter a large database 2 billion entries using python and sql

1 Answers1