Optimized search with PostgreSQL
Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.
If it gets more fuzzy or your search is not anchored at the start, look here for more:
In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular, look at operator classes and indexes on expressions.
1. text_pattern_ops
Assuming your column is of type text, you would use a special index for text pattern operators like this:
CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);
SELECT name
FROM tbl
WHERE name ~~ ('Hambu' || '%');
This is assuming that you operate with a database locale other than C
- most likely de_DE.UTF-8
in your case. You could also set up a database with locale 'C'. I quote the manual here:
If you do use the C locale, you do not need the xxx_pattern_ops
operator classes, because an index with the default operator class is
usable for pattern-matching queries in the C locale.
Update: In modern Postgres consider COLLATE "C"
. See:
2. Index on expression
I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:
CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(name) ~~ (lower('Hambu') || '%');
To make use of the index, the WHERE
clause has to match the the index expression.
3. Optimize index size and speed
Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:
CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%');
left()
was introduced with Postgres 9.1. Use substring(name, 1,10)
in older versions.
4. Cover all possible requests
What about strings with more than 10 characters?
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower(left('Hambu678910',10)) || '%');
AND lower(name) ~~ (lower('Hambu678910') || '%');
This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left()
effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.
Update: in modern Postgres consider the ^@
operator instead. See:
5. Optimize disk representation with CLUSTER
So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx
. And you mostly read and hardly ever write. This is a textbook case for CLUSTER
. The manual:
When a table is clustered, it is physically reordered based on the index information.
With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.
First call:
CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;
Information which index to use will be saved and successive calls will re-cluster the table:
CLUSTER tbl;
CLUSTER; -- cluster all tables in the db that have previously been clustered.
If you don't want to repeat it:
ALTER TABLE tbl SET WITHOUT CLUSTER;
However, CLUSTER
takes an exclusive lock on the table. If that's a problem, look into pg_repack
or pg_squeeze
, which can do the same without exclusive lock on the table.
6. Prevent too many rows in the result
Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT
the number of rows returned:
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT 501;
If your query returns more than 500 rows, tell the user to narrow down his search.
7. Optimize filter method (operators)
If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops
family. Like this:
SELECT name
FROM tbl
WHERE lower(left(name, 10)) ~>=~ lower('Hambu')
AND lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));
You gain very little with this last stunt. Normally, standard operators are the better choice.
If you do all that, search time will be reduced to a matter of milliseconds.