2

I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.

For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:

1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری

I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:

  1. I am using MS SQL Server not MySQL
  2. my table contents are in Persian, so I can't use Levenshtein distance and similar methods
  3. I prefer an SQL Server only solution, not an indexing or daemon based solution.

The best solution would be a solution which help us sort result by similarity, but, its optional.

Do you have any suggestion for that?

Thanks

Community
  • 1
  • 1
Shahab
  • 794
  • 2
  • 12
  • 23

6 Answers6

4

MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?

SELECT * FROM table WHERE input LIKE '%مختار%'
keyboardP
  • 68,824
  • 13
  • 156
  • 205
3

Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?

--This part searches for a string you want

declare @MyString varchar(max)

set @MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)


--This part searches for that string

select searchColumn, ABS(Len(searchColumn) - Len(@MyString)) as Similarity
from table where data LIKE '%' + @MyString + '%'
Order by Similarity, searchColumn

The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query. The absolute part can be avoided obviously but I did it just in case.

Hope that helps =-)

Gaspa79
  • 5,488
  • 4
  • 40
  • 63
  • Yes, its closer to what I'm looking for. But considering the length as similarity factor is not a good idea. for example "test" and "find" are not similar at all, but their length are equal. – Shahab Dec 27 '11 at 09:43
  • But wait a second, the query that I wrote it won't provide "find" as a result if you search for "test". If you search for "test" and there's "test" "find" "tester" and "testing" in your database, the result will be: test tester testing In that order. If you search for "find" instead only 1 result will be provided (find). If you search for "in" two results will be provided: find testing (in that order, due to the length stuff). I don't know persian, but if you are using an nvarchar column the result will be the same regardless the language you use. – Gaspa79 Dec 28 '11 at 04:34
  • Oh, yes, sorry, I didn't see the `where` clause. But it still don't solve my problem, because I want _find_ to be a match for _fnd_. – Shahab Dec 28 '11 at 08:41
  • Oooh, okay then. I didn't understand that. Here's the thing, first of all, if you want to do this using LIKE, it'll be a performance killer. Well, actually, if you do this for large character strings it'll be a performance killer anyway due to the immense posibilities of a word containing a couple of characters.Anyway, you first need to make an index in your search column (right click table in obj explorer, full text indexes). After that, you'll be able to use the function CONTAINS. The sintax will be: select SearchColumn from Table where contains(SearchColumn, ' "test" '). (Continues down) – Gaspa79 Dec 29 '11 at 02:59
  • That thing above will do the same thing as the LIKE function using "test". Now, if you wanna add wildcards the sintax is like this: contains(SearchColumn, ' "*test*" '). That's right, the * in Contains is the same as % in LIKE. Knowing that, you will now need a function that returns a table according to your search needs given a character string. For example, if you give the function "test", it should return a table with *test*,*t*est*,*te*st*,*tes*t*. You could go on with the wildcards but your query will take a lot of time for large tables and strings. (Continues down) – Gaspa79 Dec 29 '11 at 03:03
  • Sidenote: I Just noticed that the asterisks ( this character -> *) turns stuff into italics, oh well, whenever you see a word surrounded in italics it means that it has asterisks surrounded, meaning that if you replace the % with an asterisk here you get the first contain I intended to write : contains(SearchColumn, ' "%test%" '). Anyway, after you get that function that returns a table, you will now need a stored proc that generates a dynamic sql using each row of that table in a contains statement. Do you understand? If there's any easier way to do this, I'm not aware of it. Sorry! =( – Gaspa79 Dec 29 '11 at 03:10
  • If this solves it please answer or accept my answer because I really wanna know if you did this the way I thought it =P. If you don't, please tell me how you did it =-) – Gaspa79 Dec 29 '11 at 03:13
  • thanks for your help :) Its a working method, but its far away from what I'm looking for. Its a performance killer, and it could not sort data by their relevance. – Shahab Dec 31 '11 at 09:25
1

Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.

xQbert
  • 34,733
  • 2
  • 41
  • 62
  • There is no `instr` function in SQL Server. Perhaps you meant `SUBSTRING` also (authoratitive) citation needed for "generally slower". There is no obvious reason why performance should be different if the `LIKE` expression has a leading wildcard. Both have to do the same work. – Martin Smith Dec 26 '11 at 15:56
  • Thanks for your reply, but `LIKE` and `SUBSTRING` are not what I'm looking for. I want to match "مختر" for them too! – Shahab Dec 27 '11 at 09:39
1

The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.

The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.

And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection

Oleg Dok
  • 21,109
  • 4
  • 45
  • 54
  • Thanks Oleg, but I think FTS could help, because Persian alphabet is too similar with Arabic alphabet and FTS should support Arabic. But I have no idea on how to use FTS to solve this problem. :( – Shahab Dec 27 '11 at 09:45
  • You want me to fullfil the answer with example of using Fts? – Oleg Dok Dec 27 '11 at 10:08
  • 1
    No, I tried your strings with FTS' Arabic module - it simply not working, sorry. So - welcome to the world of Natural Language Processing 8-) – Oleg Dok Dec 27 '11 at 11:26
  • Oh...GOD :( Can you give me an example of using FTS with arabic module please? – Shahab Dec 27 '11 at 12:02
0

Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.

Look at the following reference: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/

Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for @method Algorithm:

0 The Levenshtein edit distance algorithm

1 The Jaccard similarity coefficient algorithm

2 A form of the Jaro-Winkler distance algorithm

3 Longest common subsequence algorithm

Alexander
  • 7,484
  • 4
  • 51
  • 65
0

Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server

  • you should re-format your answer. This way it is completely unreadeable and therefor not helpful to the problem. – TheMixy May 10 '21 at 14:57