How to find similar results in SQL Server?

Question

I have these items in my DB:

Harry Potter and the Chamber of Secrets
Harry Potter and the Deathly Hallows: Part 1
Harry Potter and the Deathly Hallows: Part 2
Harry Potter and the Goblet of Fire
Harry Potter and the Half-Blood Prince
Harry Potter and the Order of the Phoenix
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Sorcerer's Stone
And other movies...

How can i return above items, when user search for 'hary poter' (not 'harry potter')?

You might try using a [soundex](https://msdn.microsoft.com/en-us/library/ms187384.aspx) -- http://stackoverflow.com/questions/1923394/use-soundex-word-by-word-on-sql-server — Paul Abbott, Sep 22 '16 at 22:18
Unfortunately for loose matches, it becomes harder. A search for "deadly hallow", for example, won't work with just a plain SOUNDEX() on the entire search term. You would need to break up the words in the search term, break up the words in each of the titles, and do a SOUNDEX() on the individual terms. This could be pretty taxing, depending on the size of your table — ZLK, Sep 22 '16 at 22:41
@ZLK Is there any example for doing that? May i use cursor to implement it? — Amir, Sep 22 '16 at 22:45
It's achievable, it's just not so easy to guarantee matches on weird strings. I might post an example as an answer. — ZLK, Sep 22 '16 at 23:11

score 1 · Accepted Answer · answered Sep 22 '16 at 23:28

It's difficult to get something that works really well for this sort of thing in SQL Server. Fuzzy matches are really hard to work with when you need to search for spelling mistakes while trying not to get bad matches on things.

For example, the following is one way you could try to do this:

DECLARE @ TABLE (id INT IDENTITY(1, 1), blah NVARCHAR(255));

INSERT @ VALUES ('Harry Potter and the Chamber of Secrets')
,('Harry Potter and the Deathly Hallows: Part 1')
,('Harry Potter and the Deathly Hallows: Part 2')
,('Harry Potter and the Goblet of Fire')
,('Harry Potter and the Half-Blood Prince')
,('Harry Potter and the Order of the Phoenix')
,('Harry Potter and the Prisoner of Azkaban')
,('Harry Potter and the Sorcerer''s Stone');

DECLARE @myVar NVARCHAR(255) = 'deadly halow'; -- returns 2 matches (both parts of Deathly Hallows)
-- SET @myVar = 'hary poter'; -- returns 8 matches, all of them
-- SET @myVar = 'order'; -- returns 1 match (Order of the Phoenix)
-- SET @myVar = 'phoneix'; -- returns 2 matches (Order of the Phoenix and Half-blood Prince, the latter due to a fuzzy match on 'prince')

WITH CTE AS (
    SELECT id, blah
    FROM @
    UNION ALL
    SELECT 0, @myVar
    )
, CTE2 AS (
    SELECT id
         , blah
         , SUBSTRING(blah, 1, ISNULL(NULLIF(CHARINDEX(' ', blah), 0) - 1, LEN(blah))) individualWord
         , NULLIF(CHARINDEX(' ', blah), 0) cIndex
         , 1 L
    FROM CTE
    UNION ALL 
    SELECT CTE.id
         , CTE.blah
         , SUBSTRING(CTE.blah, cIndex + 1, ISNULL(NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0) - 1 - cIndex, LEN(CTE.blah)))
         , NULLIF(CHARINDEX(' ', CTE.blah, cIndex + 1), 0)
         , L + 1
    FROM CTE2
    JOIN CTE ON CTE.id = CTE2.id
    WHERE cIndex IS NOT NULL
    )
SELECT blah
FROM (
    SELECT X.blah, ROW_NUMBER() OVER (PARTITION BY X.ID, Y.L ORDER BY (SELECT NULL)) RN, Y.wordCount
    FROM CTE2 X
    JOIN (SELECT *, COUNT(*) OVER() wordCount FROM CTE2 WHERE id = 0) Y ON DIFFERENCE(X.individualWord, Y.individualWord) >= 3 AND X.id <> 0) T
WHERE RN = 1
GROUP BY blah
HAVING COUNT(*) = MAX(wordCount);

This splits each of the words in the search term, splits each of the words in the titles, then uses the DIFFERENCE() function, which compares the SOUNDEX() of the values and tells you how far apart they are. e.g. SOUNDEX('Halow') is 'H400' and SOUNDEX('Hallows') is 'H420' - the difference here is 3 (because H, 4 and one of the zeroes match). A perfect match would have a difference of 4, a close match has a difference above 3 generally.

Unfortunately, because you need to check for close matches, you get some false positives with this sometimes. I tested it with, for example, 'phoneix' as the input and got a match on 'Half-blood Prince' due to a fuzzy match between 'prince' and 'phoenix'. I'm sure there are ways this could be improved upon, but something like this should work as a basis for what you're trying to achieve.

You are genius :). It is exactly what i need. But there are 15000 movies in my db and many of them match 'hary poter'! I added DIFFERENCE(X.individualWord, Y.individualWord) in your nested SELECT and used SUM and 'Order By' to get most relevant titles. — Amir, Sep 23 '16 at 06:25
Yes. Unfortunately for fuzzy string matches, there's going to be a lot of false positives. Note that by making the difference = 4 you'll get a lot less false positives but the further off someone is from the actual title, the less likely they'd be to get results. I think your idea of ordering by how close it is probably works about as well as you could hope in SQL, short of putting a lot of effort into making a custom version of soundex — ZLK, Sep 23 '16 at 07:02

score 0 · Answer 2 · answered Sep 22 '16 at 22:52

You can use this query

create table #test (v varchar(50) )

insert into #test (v) values
 ('Harry Potter and the Chamber of Secrets'       )
,('Harry Potter and the Deathly Hallows: Part 1'  )
,('Harry Potter and the Deathly Hallows: Part 2'  )
,('Harry Potter and the Goblet of Fire'           )
,('Harry Potter and the Half-Blood Prince'        )
,('Harry Potter and the Order of the Phoenix'     )
,('Harry Potter and the Prisoner of Azkaban'      )
,('Harry Potter and the Sorcerer''s Stone'        )


select * from #test 
where PATINDEX('%[Hh]%ar[r]%y [pP]%ot[t]%er%', v)>0

How to find similar results in SQL Server?

2 Answers2