0

Here's the issue I am having. I want to return rows that are similiar from my table in my database. Consider two rows with the following text:

'Cisco phones cannot dial out'
'Phones are not working for outgoing calls'

These are two different rows...I tried to do something like this:

DECLARE @TheTest varchar(1000)
DECLARE @TheResult varchar(1000)

SET @TheTest = ('Cisco phones cannot dial out')
SET @TheResult = ('Phones are not working for outgoing calls')

CREATE TABLE #Test
(
MyCol varchar(1000)
)

INSERT INTO #Test(MyCol)
SELECT @TheResult

SELECT * FROM #Test WHERE LOWER(MyCol) LIKE '%' + LOWER(@TheTest) + '%'

DROP TABLE #Test

But the result is 0 rows were returned, well I understand that because the string TheTest is not close enough to the string @TheResult...But I need a solution that would actually return this row because the word phones appear in both texts.

Would I have to build something much more elaborate to seperate words and get rid of common words for something like this? I'd like to mimic functionality that I see on a specific website:

enter image description here

oJM86o
  • 2,108
  • 8
  • 43
  • 71
  • Is one word enough? So if the text you want to find contains an `I` as word and the text you want to search also contains an `I`, should the record be returned? – Tim Schmelter Oct 24 '12 at 14:09
  • If I understand correctly, you're looking for any result where *any word* from `TheTest` is present. Are you sure that's what you want? What about words like "the" or "out" that will appear commonly in many scenarios? – Dan Puzey Oct 24 '12 at 14:10
  • @TimSchmelter Yes that would be enough. – oJM86o Oct 24 '12 at 14:10
  • @DanPuzey I guess thats the issue, would this have to be another table to enter all the common words to then exclude them? – oJM86o Oct 24 '12 at 14:11
  • 2
    It sounds like the question is "How do I write Google?" – cadrell0 Oct 24 '12 at 14:13
  • @cadrell0 - Please people I am not a newbie and understand what I am asking. I attached an image of a site that does something very similiar. I typed in `i like to test` and it returned similiar records. I want to mimic this type of functionality... – oJM86o Oct 24 '12 at 14:18
  • 1
    It's not really clear what you want, even with your added example. Your 'phones' example seems to be about [finding common substrings](http://stackoverflow.com/questions/1410822/how-can-i-detect-common-substrings-in-a-list-of-strings) but your 'test' question is about fuzzy matching and here [full-text search](http://msdn.microsoft.com/en-us/library/ms142571.aspx) might be useful, depending on exactly what you want. – Pondlife Oct 24 '12 at 14:21
  • I guess fuzzy matching is what I would be looking for. – oJM86o Oct 24 '12 at 14:22
  • 2
    @oJM86o That is a very complicated thing to do. First you have to break the search query into individual words. Then you have to handle variations on those words. ex. test, testing, tested, tests. Tenses for verbs, this gets difficult when the verb changes based on tense. Now you have to look at similar words. Telephone and phone. What about can't, can not and cannot. Finally, you have to weight each word. Is, as, the, a probably matter less than phone, dial, and call. You then have to figure out how to compare each row in the database to this and weight the results. – cadrell0 Oct 24 '12 at 14:24
  • @cadrell0 - But at the end of the day people are doing this, is there a simple way to do this or is it just pretty hefty work, it sounds like a lot... – oJM86o Oct 24 '12 at 14:25
  • 1
    @oJM86o That is the point I am trying to make. This is a lot of work and the people that do this specialize in it. – cadrell0 Oct 24 '12 at 14:29
  • I agree with @cadrell0. This will require a custom function that removes articles, identifies and compare key words, etc. You can probably write up a simple function to do this, but to do that I recommend looking at text samples, making statistical summaries of those text examples, and so on. – PowerUser Oct 24 '12 at 14:37

3 Answers3

2

From what I understand of your question not even SQL FullText is going to get what you want.
You are looking for Lucene type features.
I think they use Lucene on SO for finding like questions.

A short cut is to parse the words and then take the stem (Porter) and populate a table.
Just using Regex on a word break is a start.
Or you can skip the stemming but that will miss a lot of matches (e.g. match).
Only index unique words / stems.
If a words show up 4 times in one phrase and 2 times in the other it is problematic to score.
And I think a single match per word is a more meaningful score.
Then do a join and count the number of words joined.
Need to normalize as 12 matches on 12 words is better than 14 matches on 20 words.
Like 2 * matches / (word count A + word count B).

The other direction is to not stem but use a fuzzy match like a Levenstein distance.

In your sample only phone matches so that would have a low score.
But even a Lucene or Google would have trouble giving that a high score.
Breaking down the English language for two phrases that have different words but mean the same is very complex.

paparazzo
  • 44,497
  • 23
  • 105
  • 176
0

You might want to try SOUNDEX to return the similarity between two strings : http://msdn.microsoft.com/en-us/library/ms187384.aspx

It returns a varchar that holds the similarity between two strings. You can then evaluate if the varchar is in an acceptable "similarity range" that you defined.

Dominic Goulet
  • 7,983
  • 7
  • 28
  • 56
  • 1
    `SOUNDEX()` is an algorithm designed for [finding names that sound similar](http://en.wikipedia.org/wiki/Soundex) (hence the name), it is definitely not a general-purpose text comparison function. What the OP is looking for is much more complicated. – Pondlife Oct 24 '12 at 14:16
  • Unfortunately `SOUNDEX` and `DIFFERENCE` will not work for me. – oJM86o Oct 24 '12 at 14:19
0

One way would be to use a Split function(here's a demo):

CREATE FUNCTION [dbo].[Split]
(
    @ItemList NVARCHAR(MAX), 
    @delimiter CHAR(1)
)
RETURNS @IDTable TABLE (Item VARCHAR(50))  
AS      

BEGIN    
    DECLARE @tempItemList NVARCHAR(MAX)
    SET @tempItemList = @ItemList

    DECLARE @i INT    
    DECLARE @Item NVARCHAR(4000)

    SET @i = CHARINDEX(@delimiter, @tempItemList)

    WHILE (LEN(@tempItemList) > 0)
    BEGIN
        IF @i = 0
            SET @Item = @tempItemList
        ELSE
            SET @Item = LEFT(@tempItemList, @i - 1)
        INSERT INTO @IDTable(Item) VALUES(@Item)
        IF @i = 0
            SET @tempItemList = ''
        ELSE
            SET @tempItemList = RIGHT(@tempItemList, LEN(@tempItemList) - @i)
        SET @i = CHARINDEX(@delimiter, @tempItemList)
    END 
    RETURN
END  

Now you can check if one of the words is contained in the column:

DECLARE @TheTest varchar(1000)
DECLARE @TheResult varchar(1000)

SET @TheTest = ('Cisco phones cannot dial out')
SET @TheResult = ('Phones are not working for outgoing calls')

CREATE TABLE #Test
(
MyCol varchar(1000)
)

INSERT INTO #Test(MyCol)
    SELECT @TheResult

Declare  @searchWords Table(Item varchar(100));
INSERT INTO @searchWords
    SELECT Item FROM dbo.Split(@TheTest, ' ');

SELECT * FROM #Test t
WHERE EXISTS
(
    SELECT 1 
    FROM dbo.Split(t.MyCol, ' ')cw INNER JOIN @searchWords sw
        ON cw.Item = sw.Item
);

DROP TABLE #Test 
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • Yep I guess you are right in your comment 1 word may be just too much....Appreciate your response though. – oJM86o Oct 24 '12 at 14:24