47

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:

Product.where("code ILIKE ?", "%" + params[:search] + "%")

It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".

What should I do for this? May be Postgres has some string normalization function, or some other methods to help me?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Alve
  • 1,315
  • 2
  • 17
  • 16

2 Answers2

64

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.

Example:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.

Try this query sample: (with your own object names and data of course)

SELECT * 
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10

This query says:

Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...

Note: if you get an error like:

function levenshtein(character varying, unknown) does not exist

Install the fuzzystrmatch extension using:

test=# CREATE EXTENSION fuzzystrmatch;
David Wolever
  • 148,955
  • 89
  • 346
  • 502
Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
  • Can you please show me some simple example of this function using? I have no idea how it can help me :) – Alve Oct 12 '11 at 09:29
  • I added a simple, hard-coded query sample that should be enough to get you started. Don't forget to make sure that the fuzzystrmatch module is available to you: http://www.postgresonline.com/journal/categories/57-fuzzystrmatch – Paul Sasik Oct 12 '11 at 13:38
  • Levenshtein is an expensive function, is this running Levenshtein twice per call? – triunenature Nov 22 '15 at 01:16
  • @triunenature How many times it gets called will actually depend on the optimization. If the optimizer is smart enough it'll call once and cache the results. I should also point out that even if it isn't optimized that the call's complexity is linear and not exponential so the performance impact isn't nearly as bad. – Paul Sasik Nov 22 '15 at 01:26
  • 1
    Thanks for hint with need of installing `fuzzystrmatch` for using `levenshtein` in postgresql. It helped me to get rid of `HINT: No function matches the given name and argument types. You might need to add explicit type casts` – andilabs Feb 09 '17 at 10:49
  • @erwin-brandstetter mentions that levenshtein is quite slow and that it is better to use pg_trgm. Will this still be true when using https://www.postgresql.org/docs/9.0/static/indexes-expressional.html for levenshtein? – velop Mar 26 '17 at 17:19
  • 1
    **NOTE!** there is a `levenshtein_less_equal()` as well! It is an accelerated version of the Levenshtein function for use when only **small** **distances** are of interest. If the actual distance is less than or equal to max_d, then levenshtein_less_equal returns the correct distance; otherwise it returns some value greater than max_d. If max_d is negative then the behavior is the same as levenshtein. `test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);` – simUser Nov 21 '19 at 13:09
52

Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the Levenshtein distance from the search term for every single row. That's expensive and cannot use an index. The "accelerated" variant levenshtein_less_equal() is faster for long strings, but still slow without index support.

If your requirements are as simple as the example suggests, you can still use LIKE. Just replace any - in your search term with % in the WHERE clause. So instead of:

WHERE code ILIKE '%AB-123-lHdfj%'

Use:

WHERE code ILIKE '%AB%123%lHdfj%'

Or, dynamically:

WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%'

% in LIKE patterns stands for 0-n characters. Or use _ for exactly one character. Or use regular expressions for a smarter match:

WHERE code ~* 'AB.?123.?lHdfj'

.? ... 0 or 1 characters

Or:

WHERE code ~* 'AB\-?123\-?lHdfj'

\-? ... 0 or 1 dashes

You may want to escape special characters in LIKE or regexp patterns. See:


If your actual problem is more complex and you need something faster then there are various options, depending on your requirements:

Overview of pattern-matching techniques:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228