2

I want some technique to solve my problem. I am using Postgresql 10. I want to create a query that select substrings even if there's a difference of two letters. My language, portuguese, has many letters with same sound and people really make mistakes when typing. Roughly what I want to do is this.

if I have a product name with the name "automóvel grande"

atomóvel grande" would match  //one letter is missing 

 automovel grnde" would match //lack accent and letter is missing

  automovell grnde would match // lack of accent and the "l" was repeated

I used pg_trgm sometime ago and in my memory it worked, but, now, that I am using it again I am really getting some weird results.

I am using a setting like this:

ALTER DATABASE mydb SET pg_trgm.word_similarity_threshold = 0.2;

It is not able to match "capacete" when typing "capoceti".

I think the two words are similar enough. Chaning the threshold doesn't affect the result.

1 Answers1

4

I would suggest using Levenshtein Distance (https://en.wikipedia.org/wiki/Levenshtein_distance). It is the number of operations you need to change one string to another including insertion, deletion or changing a letter by another.

First, create the extension by running this:

CREATE EXTENSION fuzzystrmatch;

Then, you can use levenshtein (string1, string2) to find the distance (integer) between string1 and string2. After you run levenstein for your columns, filter the ones with a distance of 2 or less.

In your case, try this:

Select levenshtein('capacete', 'capoceti') 

and you will get the distance of 2.

For more info on the string similarities on postgres see here: https://www.postgresql.org/docs/9.1/fuzzystrmatch.html

issuela
  • 97
  • 5
  • I think I've used levenshtein not pg_trgm. Do you know how to use it to work with substrings. I've done it before but I don't remember. –  Apr 26 '19 at 15:39
  • Would you like to calculate between specific substrings or all substrings of two strings? If the second, maybe refer to this: https://stackoverflow.com/questions/8139958/algorithm-to-find-edit-distance-to-all-substrings . For the first, just use substring(string, firstposition, lastposition). For example, select substring('capacete', 1, 4) will give you 'capa'. – issuela Apr 30 '19 at 15:07