2

Just trying to play with the ngram library of the Python and I came across an issue which is related to the similarity of the string. The ratio output was a bit confusing. See what I tried:

>>> ngram.NGram.compare('alexp','Alex Cho',N=1)*100
30.0
>>>
>>> ngram.NGram.compare('alexp','Alex Plutzer',N=1)*100
21.428571428571427
>>> ngram.NGram.compare('alexp','Alex Plutzer'.lower(),N=1)*100
41.66666666666667
>>> ngram.NGram.compare('alexp','Alex Cho'.lower(),N=1)*100
44.44444444444444
>>> ngram.NGram.compare('alexp','AlexCho'.lower(),N=1)*100
50.0
>>> ngram.NGram.compare('alexp','AlexPlutzer'.lower(),N=1)*100
45.45454545454545

The most similar must be the one having alexp i.e. Alex Plutzer but the more score is getting assigned to the former one i.e. Alex Cho
What might be done to get an appropriate result, where I get to have the output as Alex Plutzer with high score as compare to the competitive one?

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • Maybe this is due to the different lengths of the strings? Skimming over the docstring of the compare method I haven't discovered how the similarity is measured. – Quickbeam2k1 Aug 02 '17 at 07:56
  • Yeah may be. But I cannot truncate the values for comparison. So is there anything that can be done with n-grams so I get appropriate results – Jaffer Wilson Aug 02 '17 at 08:00
  • 1
    https://stackoverflow.com/questions/26037351/n-gram-character-based-similarity-measure – Alexander Aug 02 '17 at 08:02
  • @Alexander Cool... but is there any answers for the python too. Otherwise, I need to implement it in python. My application designs are complete with python and I do not wish to include any module from other languages. But if there is no option left then I will surely go with this. – Jaffer Wilson Aug 02 '17 at 08:10
  • have you tried N=2? – Quickbeam2k1 Aug 02 '17 at 10:36
  • @Quickbeam2k1 Yes I did. It work for some strings but not for all. – Jaffer Wilson Aug 02 '17 at 10:37
  • But is assume it did work in the specific case above. But since you seem to need to classify the data, maybe a machine learning approach (e.g. multiclass classification) will help. But this might be quite expensive in terms of computations. – Quickbeam2k1 Aug 02 '17 at 10:46

1 Answers1

1

With a bit of domain knowledge, using that you consider 1-grams and curve fitting, I claim that the smiliarity of two strings S and T is computed via

enter image description here

where ngrams just gives the ngrams of a string, the curly braces denotes sets and the bars/pipes denote the count of elements in that set.

So the results you obtain are correct if this formula holds true, thus the results are correct concerning this formula. Maybe what suits your needs better could be the Levensthein-Distance

Maybe you want to check the following stackoverflow thread, additionally, you might want to check if nltk provides the similarity scores you need

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • I tried `Levensthein-Distance`. Also gone through `Fuuzywuzzy` library, tried `Jaro` algorithm etc. But could not get the solution. Here in n-gram, atleast I achieved satisfactory results. – Jaffer Wilson Aug 02 '17 at 09:02
  • But I thought, n-gram wasn't satisfactory. Do you have a general definition of what strings you want to compare? maybe you need to perform normalization, like removal of blanks etc. – Quickbeam2k1 Aug 02 '17 at 09:07
  • Dear I have tried everything that you are saying or even thing of. The result is good for one then it is coming bad for others. So need a best possible algorithm that will help. So far n-gram gave me say around 85% of accuracy in result but I need at leaast 90+% to say that the result is good. – Jaffer Wilson Aug 02 '17 at 09:23
  • maybe, you could try word2vec, you might want to check this [thread](https://stackoverflow.com/questions/21979970/how-to-use-word2vec-to-calculate-the-similarity-distance-by-giving-2-words) – Quickbeam2k1 Aug 02 '17 at 10:01
  • sorry, but I don't think that this will work in my case scenario. But is a good option I can think of using and making certain vectors. – Jaffer Wilson Aug 02 '17 at 10:11
  • Yes they are. But my data is damn fuzzy. What I am doing is having a collection of distorted strings just like the `alexp` and have billions of names. I am trying to match those names with the string so that I will come to know what exactly the word match was in the application for word match. That's the concept. – Jaffer Wilson Aug 02 '17 at 10:37
  • should your fuzzy string always be compared with names at the beginning? in that case you could just compute ho many character of your fuzzy string match a properly normalized name( i.e. lowercase, without whitespace etc.) Could you provide some more example of fuzzy strings, particularly, to show the structure of the data. However, you also may just want to try to fit a machine learning model (e.g. a multiclass classifier) – Quickbeam2k1 Aug 02 '17 at 10:41