How to create a query that consider a match even if it has two different letters

Question

I want some technique to solve my problem. I am using Postgresql 10. I want to create a query that select substrings even if there's a difference of two letters. My language, portuguese, has many letters with same sound and people really make mistakes when typing. Roughly what I want to do is this.

if I have a product name with the name "automóvel grande"

atomóvel grande" would match  //one letter is missing 

 automovel grnde" would match //lack accent and letter is missing

  automovell grnde would match // lack of accent and the "l" was repeated

I used pg_trgm sometime ago and in my memory it worked, but, now, that I am using it again I am really getting some weird results.

I am using a setting like this:

ALTER DATABASE mydb SET pg_trgm.word_similarity_threshold = 0.2;

It is not able to match "capacete" when typing "capoceti".

I think the two words are similar enough. Chaning the threshold doesn't affect the result.

*"doesn't affect the result"* What result? You did not show any expression or query. — Erwin Brandstetter, Apr 26 '19 at 14:58

issuela · Answer 1 · 2019-04-26T13:23:14.110

4

I would suggest using Levenshtein Distance (https://en.wikipedia.org/wiki/Levenshtein_distance). It is the number of operations you need to change one string to another including insertion, deletion or changing a letter by another.

First, create the extension by running this:

CREATE EXTENSION fuzzystrmatch;

Then, you can use levenshtein (string1, string2) to find the distance (integer) between string1 and string2. After you run levenstein for your columns, filter the ones with a distance of 2 or less.

In your case, try this:

Select levenshtein('capacete', 'capoceti')

and you will get the distance of 2.

For more info on the string similarities on postgres see here: https://www.postgresql.org/docs/9.1/fuzzystrmatch.html

edited Apr 26 '19 at 13:23

answered Apr 26 '19 at 13:15

issuela

97
5

I think I've used levenshtein not pg_trgm. Do you know how to use it to work with substrings. I've done it before but I don't remember. – Apr 26 '19 at 15:39
Would you like to calculate between specific substrings or all substrings of two strings? If the second, maybe refer to this: https://stackoverflow.com/questions/8139958/algorithm-to-find-edit-distance-to-all-substrings . For the first, just use substring(string, firstposition, lastposition). For example, select substring('capacete', 1, 4) will give you 'capa'. – issuela Apr 30 '19 at 15:07

How to create a query that consider a match even if it has two different letters

1 Answers1