Finding strings differing by one character in dictionary

Question

I have a dictionary containing strings as keys and the number of times they occur in a file as values. I am trying to find a way to find the strings that differ by one character and then remove the string with the lowest count from the dictionary.

From this:

dictionary = {'ATAA':5, 'GGGG':34, 'TTTT':34, 'AGAA':1}

To this:

new_dictionary = {'ATAA':5, 'GGGG':34, 'TTTT':34}

The dictionary is huge, so I am trying to find an efficient way to solve this. Any suggestions of how one could solve it would be super appreciated.

Take a look at this post:https://stackoverflow.com/questions/25216328/compare-strings-allowing-one-character-difference — GeorgesAA, May 09 '21 at 21:01
Have a look at the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) — Alexandre B., May 09 '21 at 21:17

Buddy Bob · Accepted Answer · 2021-05-09T22:29:01.633

This would be my homemade recipe. First, we gather all the keys with a unique character. Then we sort this new dictionary by keys. In your case we will end up with {'AGAA': 1, 'ATAA': 5} which means we can take AGAA and delete it from the dictionary.

import collections
dic = {'ATAA':5, 'GGGG':34, 'TTTT':34, 'AGAA':1}
del dic[list({k: v for k, v in sorted({k:v for k,v in dic.items() if len(set(k)) == 2}.items(), key=lambda item: item[1])}.keys())[0]]

output

{'ATAA': 5, 'GGGG': 34, 'TTTT': 34}

Now now there is more. What if you had some keys with similar values. The above code will not work. I spent the last couple minutes baking up some new code.

I'll break it down.

import collections
from collections import defaultdict
#----------
#This will give us {'ATAA': 5, 'AGAA': 5}, we have located the different keys
dictionary = {'ATAA':5, 'GGGG':34, 'TTTT':34, 'AGAA':5}
lowest =  {k: v for k, v in sorted({k:v for k,v in dictionary.items() if len(set(k)) == 2}.items(), key=lambda item: item[1])}
#----------
#This will give us ['ATAA', 'AGAA']. Checks for all keys with similar values.
grouped = defaultdict(list)
for key in lowest:grouped[lowest[key]].append(key)
simKeys = min(grouped.values(), key=len)
#----------
#This will check if we have to delete many keys or just one
if len(simKeys) > 1:x = {k:v for k,v in dictionary.items() if k not in simKeys}
if len(simKeys) == 1:del dictionary[list(lowest.keys())[0]]
#----------

Thanks BuddyBob! What if I have the following dictionary: dictionary = {'ATAA':53, 'GGGG':34, 'GCGG':3, 'AGAA':5}. Then I would like the following output: dictionary = {'ATAA': 53, 'GGGG': 34}. In the solution you made only one comparison is made. — Vibramat, May 10 '21 at 07:57
Why would it be that? `GCGG` is the lowest different key. My output is `{'ATAA': 53, 'GGGG': 34, 'AGAA': 5}` — Buddy Bob, May 10 '21 at 14:44

Finding strings differing by one character in dictionary

1 Answers1