2

Suppose I have a table like this

id  data
1   0001
2   1000
3   2010
4   0120
5   0020
6   0002

sql fiddle demo

id is primary key, data is fixed length string where characters could be 0, 1, 2.

Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.

Thanks.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197

1 Answers1

2

There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:

SELECT *
FROM   a
WHERE  levenshtein(data, '1110') = 1;

SQL Fiddle.

But it is not very fast. Slow with big tables, because it can't use an index.

You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".

Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.

If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:

SELECT *
FROM   a
WHERE  data <> '1110'
AND   (data LIKE '_110' OR
       data LIKE '1_10' OR
       data LIKE '11_0' OR
       data LIKE '111_');

Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.

However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.

I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228