2

I have so many Unicode strings and want to store them in MySQL database. Also I want to add an extra field such that represents the character identity of the string. For example:

String                     key
------                    -----------
this is 1st string        113547858
this is first string      113547865
I go to school            524872354

As you may have noticed above, the first 2 keys are so close to each other, representing strings similarity, whereas the 3rd one is so far from them.

I don't want to use PHP's similar_text or levenshtein as they need two strings to check similarity, but I want to store a value for each single string to store in DB in order to put an index on it for future use.

halfer
  • 19,824
  • 17
  • 99
  • 186
Peyman Mohamadpour
  • 17,954
  • 24
  • 89
  • 100
  • look on http://stackoverflow.com/questions/5351659/algorithms-for-string-similarities-better-than-levenshtein-and-similar-text – Haim Evgi Jun 22 '14 at 06:44
  • You're basically asking for a digest algorithm which nonetheless keeps certain characteristics of the input intact, yet supposedly ensures uniqueness. I'm not sure such a thing exists (though I'm no expert in that field). The original string is its own best unique-with-similarity representation already. What is your use case for this? Maybe there's a less esoteric way to solve your underlying problem. – deceze Jun 22 '14 at 08:34
  • @HaimEvgi thanks but it does not solve my problem as I dont want to compare 2 strings – Peyman Mohamadpour Jun 22 '14 at 17:12
  • @deceze I want to store it in DB and check any new posts for possible duplicate and present them to admin – Peyman Mohamadpour Jun 22 '14 at 17:12
  • And why not put an index on the column and use string comparisons to search for duplicates? – deceze Jun 22 '14 at 17:27
  • @deceze because that way I should compare each new post's content, with all existing posts's content – Peyman Mohamadpour Jun 24 '14 at 10:15

1 Answers1

0

Simple summation of the character codes of all characters of the string can be a solution?

Update:

Summation of a hash value at the level of every word of the string can also be a solution

mmonem
  • 2,811
  • 2
  • 28
  • 38
  • Does it cover similarity? I need to detect similar string be detected as duplicate – Peyman Mohamadpour Mar 19 '17 at 08:33
  • To some extent, yes. – mmonem Mar 19 '17 at 18:25
  • But, of course, many strings would have the same sum. In this case the code you are giving to strings represents just **Possibility of Similarity**. When you make sorting or filter on this code field, you can then make further checks to be sure of similarity. This can save a lot of time if you apply your algorithm of similarity on the whole dataset – mmonem Mar 19 '17 at 18:34
  • But real strings like names, addresses can work. Again that code can give you just a clue of similarity and requires further checking as i said – mmonem Mar 20 '17 at 04:52
  • You are absolutely wrong as there as so many [**Same Letters Different Words**](http://www.litscape.com/word_tools/anagram.php) like `caress ` & `scares` or `paste` & `peats` & `septa` & `spate` & `tapes` for example. So this is not a solution at all – Peyman Mohamadpour Apr 05 '17 at 08:11
  • First, I am talking about probability. What is the probability of having such words in real data. Second, I mentioned that it can give you just a **possibility** which means that you may need to search those **short-listed** sentences that caused this kind of collision to check if they really similar or not. Anyway, you know your problem better than me – mmonem Apr 07 '17 at 04:16