Clusters of words with Hamming distance of 1

Question

You are given a set of words, e.g.
{ ded, aaa, dec, aab, cab, def }
You need to add each word into a group. A word can be added to a group if:

It is the only word in the group

There is at least one other word already in the group that is at the most 1 edit distance away from the given word.

Your function must return the minimum possible number of such groups that can be formed by using all the given words.

Example, for the given input string, the groups would look like this:
{ aaa, aab, cab }, { ded, def, dec }
Explanation: distance(aaa, aab) is 1 so they belong in the same group. Also, distance(aab, cab) is also 1, so they also belong in the same group. But no words in the second group are at an edit distance of 1 from any other word in the first group, but are at an edit distance of 1 from at least one other word in their own group.

If we were given two more words in addition to the ones in the example, let's say cad, ced, then the answer would change to 1, because now distance(cab, cad) is 1, hence cad is in group 1 and distance(cad, ced) is 1, so ced is in group 1. Also, distance(ded, ced) is 1, so the second group will be "connected" with the first group, hence we will be left with only 1 group.

We're only interested in the number of groups, not the groups themselves.

Constraints: all words will have the same length, but that length is not fixed and could be large.

I could only come up with O(mn^2) where m is the length of any word and n is number of words. Did this using graph approach (Each word as a node and word with edit distance 1 as a neighbouring node).

Expected solution is O(mn).

it is not. But aaa to aab is 1 and aab to cab is 1. So they belong in the same cluster. — rhoadster91, Mar 15 '18 at 13:27
I don't understand why you're not grouping all the words in your example in just one cluster. Each word in such a cluster would have at least one 1-edit-distance partner, which seems to be what you're asking for. — גלעד ברקן, Mar 15 '18 at 13:44
there are no words in the second group who are at edit distance 1 from any word in the first group. The condition to be in a group is that they should have at most an edit distance of 1 from any other word in the group. — rhoadster91, Mar 15 '18 at 13:46
But `aaa` does not have "at most an edit distance of 1 from any other word in the group" it's in. — גלעד ברקן, Mar 15 '18 at 13:48
Where does your expected solution come from? And try to use technical terms. No graph-expert here, but this sounds somewhat like: graph-partitioning where each partition is a connected graph (where edges are binary-indicators of edit-distance <= 1). — sascha, Mar 15 '18 at 13:48
I have updated the question with slightly different wording. aaa and aab have an edit distance of 1, hence they should be in the same group. Again, aab and cab have an edit distance of 1, so they should be in the same group again. def does not have edit distance of 1 from any word in the first group so it is in a new group. def is at edit distance of 1 from ded so it is in the ded group. This continues till all words are in some group. — rhoadster91, Mar 15 '18 at 13:55
@גלעדברקן `aaa` does have at most edit distance of 1 from `aab` hence it is in the same group as `aab`. However, there is no word in the second group who has at most edit distance of 1 from any word in group 1. Hence they are in a separate group. — rhoadster91, Mar 15 '18 at 14:15
@sascha I was interviewing in a company where they asked me this question. I was able to answer in O(mn^2) because edit distance calculation takes m comparisons and had to compare each word with every other word in the list to create a graph. I managed to solve the question, but the interviewer said it can definitely be done in O(mn). I do not know the solution and was hoping someone can help me understand the approach. — rhoadster91, Mar 15 '18 at 14:27
I think the trick is that since you're only interested in whether the distance is 1 or a number greater than one (I assume it can't be zero), you don't actually have to calculate all the pairwise distances. — biziclop, Mar 15 '18 at 14:46
@biziclop as per the problem statement, distance can be 0. The criteria to put the word in a group is that distance should be at the most 1 from at least 1 other member of the group. also, I cannot figure out how I can use information from a previous compare operation. e.g. `distance(aaa, aab)` is 1. But how will this give me an indication of which character was swapped? A subsequent entry `cab` can be at distance of 1 from `aaa` OR `aab`. So wouldn't I need to calculate the distance from both anyway? — rhoadster91, Mar 15 '18 at 14:52
Zero distance means words aren't all unique. Are you sure this is allowed? — biziclop, Mar 15 '18 at 15:07
@biziclop yes, I explicitly asked this to the interviewer and he said it was allowed. — rhoadster91, Mar 15 '18 at 15:09
Okay, thanks. So, the most naive thing you can do is when calculating the distance, stop as soon as you reach the second differing character. This won't bring the asymptotic time down, but it's an example of how you can potentially exploit the fact that you don't need to know the distance, just if it's 0,1 or more. — biziclop, Mar 15 '18 at 15:11
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/166900/discussion-between-rhoadster91-and-biziclop). — rhoadster91, Mar 15 '18 at 15:15
I don't see O(mn) approach yet, but O(m^2*n) is possible using radix sort — MBo, Mar 15 '18 at 15:25
Quote from your comment: "The criteria to put the word in a group is that distance should be at the most 1 from at least 1 other member of the group." If you placed all the words in your example, `ded, aaa, dec, aab, cab, def` in one cluster, that criteria would hold for all words. — גלעד ברקן, Mar 15 '18 at 18:21
@גלעדברקן yes, but you cannot add an element to a group unless the rule is satisfied. When you are traversing the list serially, you will add ded in one group but you cannot add aaa to that group because the condition is not being satisfied. — rhoadster91, Mar 15 '18 at 18:35
@גלעדברקן anyway, sorry for confusing you. I have found the solution to the original question. thanks for the help. — rhoadster91, Mar 15 '18 at 18:35

rhoadster91 · Answer 1 · 2018-03-15T16:18:32.587

Found a solution which is an extension of the accepted solution here: Efficiently build a graph of words with given Hamming distance

Basically, the idea is to store the strings in a Set where lookup and delete are O(1) on average. Putting them in a set means we'd be overwriting strings with edit distance of 0 i.e. equal strings. But we don't care for them anyway, as they will always be in the same group.

Create an empty list of "start nodes" N.
Add next item from the set S in the list
Remove this string from the set S and call 4 for this string.
Generate all strings with Hamming distance 1 from string passed in parameter. For each such generated string, if it exists in the set, remove it from the set and call 4 for this string.
While set is not empty, repeat 2
Return size of "start nodes" list

Explanation of why this would work:

We traverse each node only once and remove it from the set. After we remove the string from the set, we also recursively remove any item in the set that was "adjacent" to it. But only the first node in the recursion is added to the start nodes list.

In our example, ded would get added to the node list and dec, def would get removed. Then aaa would get added to the node list and aab would be removed. While removing aab, recursively, cab would also be removed. The returned answer would be 2.

Time complexity:

O(mnC) where C is the size of the charset, m is the length of the string and n is the number of strings.

C substitutions made for each character in string m times. This is done once for each item in the string set.

Clusters of words with Hamming distance of 1

1 Answers1