Is there a way to drop a letter in a string if it repeats?
For example lets say that I have the string aaardvark
and I wanted to drop one of the beginning a
, how would I do this?
Is there a way to drop a letter in a string if it repeats?
For example lets say that I have the string aaardvark
and I wanted to drop one of the beginning a
, how would I do this?
If I understood your question correctly, you can do this using regular expressions:
import re
re.sub(r'(.)\1+', r'\1', 'aardvarrk')
This collapses all sequences of identical characters into one, giving you 'ardvark'
.
As for the implementation of your spell checker, I suggest "collapsing" all words that have repeating characters in sequence in your dictionary and keeping that in a dictionary (data structure), where the key is the collapsed word and the value is the original word (or possibly a set
of original words):
{
'aple': 'apple',
'acord': 'accord'
'halo': set(['hallo', 'halo'])
}
Now when you analyze your input, for each word:
Check if it exists in your list of correct words. If it does, ignore it. (eg: input is 'person'
. It's in the list of words. Nothing to do here).
If it doesn't, "collapse" it and see if:
'computerr'
becomes 'computer'
. Now you just replace it with the original word in your list).'aaapppleee'
become 'aple'
. Now you look up 'aple'
in your word list. It's not there. Now look in your dictionary for the key 'aple'
. If it is there. Replace it with its value, 'apple'
.)The only problem I see with this approach is two valid words possibly "collapsing" into the same "word." This means you'll have to use a set
as your value.
Say 'hallo'
and 'halo'
are both valid words and the user enters 'halloo'
. Now you'll have to decide which one to replace with. This can be done by calculating the Levenshtein distance between the input and the possible replacements.
Here is a solution that will allow you to iterate over all versions of the string with different combinations of repeated letters:
from itertools import product, groupby
# groups == ['aaaa', 'ppp', 'lll', 'ee']
groups = [''.join(g) for c, g in groupby('aaaappplllee')]
# lengths is an iterator that will return all combinations of string lengths to
# use for each group, starting with [4, 3, 3, 2] and ending with [1, 1, 1, 1]
lengths = product(*[range(x, 0, -1) for x in map(len, groups)])
# Using the lengths from the previous line, this is a generator that yields all
# combinations of the original string with duplicate letters removed
words = (''.join(groups[i][:v] for i, v in enumerate(x)) for x in lengths)
>>> for word in words:
... print word
...
aaaappplllee
aaaapppllle
aaaapppllee
aaaappplle
aaaappplee
aaaappple
...
apple
aplllee
apllle
apllee
aplle
aplee
aple
This is not the most efficient solution for finding the correct word, but it is consistent with the OP's original method for finding the match.
Here's a completely different approach using difflib from the standard library:
import difflib
words = open('/usr/share/dict/words').read().split()
difflib.get_close_matches('aaaappplllee', words, 3, 0.5)
['appalled', 'apple', 'appellate']
difflib.get_close_matches('aaardvarrk', words, 3, 0.5)
['aardvark', 'aardvarks', "aardvark's"]