Find nearest matching strings and its substrings using SQL query

Question

I want to write a query that will fetch nearest matching strings of given string and its sub-strings in that order.

For example, lets say am having table of all names in a column. If I want to search name "ATUL", then results should list all distinct names matching first "ATUL%" then "ATU%" then "AT%" and then "A%" and finally all remaining records.

(Then I am going to pick up first N records out of it based on my needs)

Distinct union of queries is one solution I can think of. Is there any more efficient way to do this?

UPDATE:

Thanks for answers below. Meanwhile I was trying something on my own and found this query producing expected results, provided I have username column indexed

select * FROM all_usernames WHERE (username LIKE 'atul%') or (username LIKE 'atu%') or (username LIKE 'at%') or (username LIKE 'a%') or (username LIKE '%');

But is it standard behaviour or is it that I am just getting it coincidently?

Wouldn't you need something like `"OTUL"` too, though? I might be wrong, but it sounds like you want to find strings with minimum Hamming distance, not just the same initial characters. — Timekiller, Dec 19 '15 at 17:31
Only sub-strings generated by eliminating last character one by one. — Atul, Dec 19 '15 at 17:34
Looks like you want calculate Levenstein distance between strings. Look at question: http://stackoverflow.com/questions/634995/implementation-of-levenshtein-distance-for-mysql-fuzzy-search and implementation: http://www.artfulsoftware.com/infotree/queries.php#552 — Alex Yu, Dec 20 '15 at 09:51

score 2 · Accepted Answer · answered Dec 19 '15 at 18:16

2

One method is to use like in the order by:

order by (case when name like 'ATUL%' then 1
               when name like 'ATU%' then 2
               when name like 'AT%' then 3
               when name like 'A%' then 4
               else 5
          end)

A more generic method is also brute force, but could go something like this:

order by (case when left(name, 9) = left('ATUL', 9) then 1
               when left(name, 8) = left('ATUL', 8) then 2
               when left(name, 7) = left('ATUL', 7) then 3
               when left(name, 6) = left('ATUL', 6) then 4
               when left(name, 5) = left('ATUL', 5) then 5
               when left(name, 4) = left('ATUL', 4) then 6
               when left(name, 3) = left('ATUL', 3) then 7
               when left(name, 2) = left('ATUL', 2) then 8
               when left(name, 1) = left('ATUL', 1) then 9
          end)

answered Dec 19 '15 at 18:16

Gordon Linoff

1,242,037
58
646
786

Thanks for this answer. Meanwhile I was trying something on my own and found `select * FROM all_usernames WHERE (username LIKE 'atul%') or (username LIKE 'atu%') or (username LIKE 'at%') or (username LIKE 'a%') or (username LIKE '%');` returns me expected results provided I have _username_ column **indexed** But is it standard behaviour or is it that I am just getting it coincidently? – Atul Dec 19 '15 at 19:59
@Atul . . . I actually considered phrasing the answer that way at first, but I thought the first version was clearer. In general, indexes affect the *performance* of queries. However, indexes do not affect the *semantics* of queries. That is, with or without an index, the query should do the same thing. – Gordon Linoff Dec 22 '15 at 02:14
Thanks for clarification. With this comment it made a complete answer for me :) – Atul Dec 22 '15 at 04:21

score 0 · Answer 2 · answered Dec 19 '15 at 19:27

Well, ATUL%, ATU% and AT% are all subsets of A%, so it's sufficient to select A% to get all your results. The tricky part is ordering them by how many first characters match. There seems to be no easy or elegant way to find this out, so if you want something generic you'll have to write your own function that compares substrings of string 1 and string 2 in a loop until they differ or length of either strings is reached, something like this:

CREATE FUNCTION `compare_first_chars`(str1 varchar(1000), str2 varchar(1000))
  RETURNS int
  DETERMINISTIC
BEGIN
  DECLARE v_offset INT;
  DECLARE v_minlen INT;

  IF str1 is null or str2 is null THEN
    return 0;
  END IF;

  SET v_offset = 0;

  SET v_minlen = least(length(str1), length(str2));

  count_loop: LOOP
    SET v_offset = v_offset + 1;

    IF v_offset > v_minlen THEN
      LEAVE count_loop;
    END IF;

    IF substr(str1, 1, v_offset) != substr(str2, 1, v_offset) THEN
      LEAVE count_loop;
    END IF;

  END LOOP;

  RETURN v_offset-1;
END

then you can order desc by it. If you don't need something that complex, then either use CASE in order, or distinct union as you mentioned in answer.

Find nearest matching strings and its substrings using SQL query

2 Answers2

Linked