How can I merge similar words into one?

Question

The similar_text gem can calculate the words's pairwise similarity. I want to merge words whose similarity is greater than 50% into one, and keep the longest one.

Original

[
  "iphone 6",
  "iphone 5c",
  "iphone 6",
  "macbook air",
  "macbook",
]

Expected
```
[
  "iphone 5c",
  "macbook air",
]
```

But I don't know how to implement the algorithm to filter the expected results efficiently.

What if the words are of the same length? – sawa Jul 25 '15 at 12:03 — sawa, Jul 25 '15 at 12:03

score 0 · Answer 1 · edited May 23 '17 at 10:24

This is not a trivial problem and also not 100% what you're looking for.

Especially how to handle transitive similarities: If a is similar to b and b similar to c, are a and c in the same group (even if they aren't similar to each other?)

Here is a piece of code where you can find all similar pairs in an array:

  def find_pairs(ar)
    ar.product(ar).reject{|l,r| l == r}.map(&:sort).uniq
    .map{|l,r| [[l,r],l.similar(r)]}
    .reject{|pair, similarity| similarity < 50.0}
    .map{|pair, _| pair}
  end

For an answer on how to find the groups in the matches see:

Finding All Connected Components of an Undirected Graph

score 0 · Answer 2 · answered Jul 25 '15 at 16:17

First of all - There is no efficient way to do this as you must calculate all piers which can take long times on long lists.

Having said that...

I am not familiar with this specific gem but I'm assuming it will give you some sort of distance between the two words (the smaller the better) or probability the words are the same (Higher results are better). Let's go with distance as changing the algorithm to probability is trivial.

This is just an algorithm description that you may find useful. It helped my in a similar case.

What I suggest is to put all the words in a 2 dimensional array as the headers of the rows and columns. If you have N words you need NxN matrix.

In each cell put the calculated distance between the words (the row and column headers).

You will get a matrix of all the possible distances. Remember that in this case we look for minimum distance between words.

So, for each row, look for the minimum cell (not the one with a zero value which is the distance of the word with itself).

If this min is bigger than some threshold than this word has no similar words. If not, look for all the words in this row with a distance up to the threshold (actually you can skip the previous stage and just do this search). All of the words you found belong to the same group. Look for the longest and use it in the new list you are building.

Also note the column indexes you found minimum distances in, and skip the rows with that indexes (so you will not add the same words to different groups).

How can I merge similar words into one?

2 Answers2