1

I have written code that compares two strings to find matching words. Now I'd like to be able to find words that are relatively close. For example, book and brook are similar whereas book and luck are not. How should I go about this?

I was thinking to split each word into characters then count the frequency of said characters? Right now a matched word gives the value, 0. Otherwise, 2 is given but I'd like to expand that part to do what I described above.

for i in range(0, out.shape[0]):  # from 0 to total number of rows out.shape[0] is rows - out.shape[1] is columns
    for word in refArray:  # for each word in the samplearray

        #out.ix[i, str(word)] = out.index[i].count(str(word))
        if out.index[i].count(str(word)) == 1:
            out.ix[i, str(word)] = 0 
        else:
            out.ix[i, str(word)] = 2
Brndn
  • 676
  • 1
  • 7
  • 21
  • 3
    cosine similarity is one of the way to implement it. You can also use `diiflib` library. – Sociopath Jul 16 '18 at 10:15
  • maybe add +2 to count if it is the same letter and same position and +1 if just same letter in word, otherwise +0 – ddor254 Jul 16 '18 at 10:18
  • What you need is cosine similarity between two strings. Check out an example implementation here - https://stackoverflow.com/questions/15173225/calculate-cosine-similarity-given-2-sentence-strings – Pruthvi Kumar Jul 16 '18 at 10:22
  • I tried your method Pruthvi and it works for the whole string. It gives the value, 0, when trying to compare individual words. I'm trying to rectify this atm. – Brndn Jul 16 '18 at 15:27

2 Answers2

0

You want to calculate the edit distance. https://en.wikipedia.org/wiki/Edit_distance

$ pip3 search edit | grep distance
edith (0.1.0a1)            - Edit-distanc implementation with edit-path retrieval
string-distance (1.0.0)    - Minimum Edit Distance
subdist (0.2.1)            - Substring edit distance
editdist (0.1)             - Calculate Levenshtein's edit distance
leven (1.0.4)              - Levenshtein edit distance library
Thomas Strub
  • 1,275
  • 7
  • 20
-1

I ended up using nltk after browsing Google. I just need to compare simple words at this stage to get the basic functioning of my program. Will consider the more complex solutions later on. Appreciate the help.

import nltk
nltk.edit_distance("word1", "word2")

Source: https://datascience.stackexchange.com/a/12583/56244

Brndn
  • 676
  • 1
  • 7
  • 21
  • Why was I downvoted? Is there some kind of bureaucracy here? – Brndn Jul 16 '18 at 14:35
  • I'm going to diverge into phonetics eventually so this is for functional demonstration rather than structural longevity. – Brndn Jul 17 '18 at 14:43