How to get a unique key from Unicode string in PHP?

Question

I have so many Unicode strings and want to store them in MySQL database. Also I want to add an extra field such that represents the character identity of the string. For example:

String                     key
------                    -----------
this is 1st string        113547858
this is first string      113547865
I go to school            524872354

As you may have noticed above, the first 2 keys are so close to each other, representing strings similarity, whereas the 3rd one is so far from them.

I don't want to use PHP's similar_text or levenshtein as they need two strings to check similarity, but I want to store a value for each single string to store in DB in order to put an index on it for future use.

look on http://stackoverflow.com/questions/5351659/algorithms-for-string-similarities-better-than-levenshtein-and-similar-text — Haim Evgi, Jun 22 '14 at 06:44
You're basically asking for a digest algorithm which nonetheless keeps certain characteristics of the input intact, yet supposedly ensures uniqueness. I'm not sure such a thing exists (though I'm no expert in that field). The original string is its own best unique-with-similarity representation already. What is your use case for this? Maybe there's a less esoteric way to solve your underlying problem. — deceze, Jun 22 '14 at 08:34
@HaimEvgi thanks but it does not solve my problem as I dont want to compare 2 strings — Peyman Mohamadpour, Jun 22 '14 at 17:12
@deceze I want to store it in DB and check any new posts for possible duplicate and present them to admin — Peyman Mohamadpour, Jun 22 '14 at 17:12
And why not put an index on the column and use string comparisons to search for duplicates? — deceze, Jun 22 '14 at 17:27
@deceze because that way I should compare each new post's content, with all existing posts's content — Peyman Mohamadpour, Jun 24 '14 at 10:15

mmonem · Answer 1 · 2014-06-22T06:31:58.940

0

Simple summation of the character codes of all characters of the string can be a solution?

Update:

Summation of a hash value at the level of every word of the string can also be a solution

edited Jun 22 '14 at 06:31

answered Jun 22 '14 at 06:26

mmonem

2,811
2
28
38

Does it cover similarity? I need to detect similar string be detected as duplicate – Peyman Mohamadpour Mar 19 '17 at 08:33
To some extent, yes. – mmonem Mar 19 '17 at 18:25
But, of course, many strings would have the same sum. In this case the code you are giving to strings represents just **Possibility of Similarity**. When you make sorting or filter on this code field, you can then make further checks to be sure of similarity. This can save a lot of time if you apply your algorithm of similarity on the whole dataset – mmonem Mar 19 '17 at 18:34
But real strings like names, addresses can work. Again that code can give you just a clue of similarity and requires further checking as i said – mmonem Mar 20 '17 at 04:52
You are absolutely wrong as there as so many [**Same Letters Different Words**](http://www.litscape.com/word_tools/anagram.php) like `caress ` & `scares` or `paste` & `peats` & `septa` & `spate` & `tapes` for example. So this is not a solution at all – Peyman Mohamadpour Apr 05 '17 at 08:11
First, I am talking about probability. What is the probability of having such words in real data. Second, I mentioned that it can give you just a **possibility** which means that you may need to search those **short-listed** sentences that caused this kind of collision to check if they really similar or not. Anyway, you know your problem better than me – mmonem Apr 07 '17 at 04:16

How to get a unique key from Unicode string in PHP?

1 Answers1