1

I have a table of files paths with hashes of its content, there could be multiple files with the same hash.

create table files(
  path varchar(256) not null,
  hash varchar(100) not null
);

create index files_hash on files (hash);

Given I have array of 3 hashes 'a', 'b', 'c', how can I find efficiently which of those hashes the files table contains?

I can use select distinct hash to get hashes that exist in files:

select distinct hash
from   files
where  hash in ('a', 'b', 'c')

But would it be efficient? Like say there are hundreds of thousands of files with hash 'a', would PostgreSQL iterate over all those records? Is there a way to tell it to stop immediately as soon as it finds the first one?

Ken White
  • 123,280
  • 14
  • 225
  • 444
Alex Craft
  • 13,598
  • 11
  • 69
  • 133

2 Answers2

1

If you want all hashes from an array, I would suggest:

select distinct hash
from files
where hash = any(array['a', 'b', 'c']);

For performance on this query, you want an index on files(hash).

If you only want one returned, then this should be faster:

select hash
from files
where hash = any(array['a', 'b', 'c'])
limit 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

This should be as fast as it gets:

SELECT * 
FROM   unnest('{a,b,c}'::varchar[]) AS arr(hash)
WHERE  EXISTS (SELECT FROM files f WHERE f.hash = arr.hash);

If your table is VACUUM'ed enough, you get an index-only scan on your files_hash index with constant (excellent) performance, no matter how many hashes match. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228