Dropping a letter if it repeats

Question

Is there a way to drop a letter in a string if it repeats?

For example lets say that I have the string aaardvark and I wanted to drop one of the beginning a, how would I do this?

Do you want to drop the character if it's the same as the preceeding one, or if it was before anywhere in the string? If the latter, do you need the order to be preserved? — millimoose, Oct 17 '11 at 21:58
Check [this](http://stackoverflow.com/questions/7794208/python-remove-duplicate-words-in-string/7794257#7794257). I received a lot of flak for this answer, but its short and simple. — spicavigo, Oct 17 '11 at 21:58
please notice that I have 3 `a` and want to go down to 2 `a`. — locoboy, Oct 17 '11 at 22:02

999999 · Accepted Answer · 2011-10-17T22:32:39.483

3

If I understood your question correctly, you can do this using regular expressions:

import re
re.sub(r'(.)\1+', r'\1', 'aardvarrk')

This collapses all sequences of identical characters into one, giving you 'ardvark'.

As for the implementation of your spell checker, I suggest "collapsing" all words that have repeating characters in sequence in your dictionary and keeping that in a dictionary (data structure), where the key is the collapsed word and the value is the original word (or possibly a set of original words):

{
 'aple': 'apple',
 'acord': 'accord'
 'halo': set(['hallo', 'halo'])
}

Now when you analyze your input, for each word:

Check if it exists in your list of correct words. If it does, ignore it. (eg: input is 'person'. It's in the list of words. Nothing to do here).
If it doesn't, "collapse" it and see if:
1. It exists in your word list. If it does, replace it. (eg.: 'computerr' becomes 'computer'. Now you just replace it with the original word in your list).
2. A key exists in your dictionary. If it does, replace it with the word associated with that key. (eg: 'aaapppleee' become 'aple'. Now you look up 'aple' in your word list. It's not there. Now look in your dictionary for the key 'aple'. If it is there. Replace it with its value, 'apple'.)

The only problem I see with this approach is two valid words possibly "collapsing" into the same "word." This means you'll have to use a set as your value.

Say 'hallo' and 'halo' are both valid words and the user enters 'halloo'. Now you'll have to decide which one to replace with. This can be done by calculating the Levenshtein distance between the input and the possible replacements.

edited Oct 17 '11 at 22:32

answered Oct 17 '11 at 21:58

999999

1,873
3
14
20

please notice that i have 3 `a` and want to go down incrementally to 2 `a`, then check it, then 1 `a`, then check it. – locoboy Oct 17 '11 at 22:03
so what i'm ultimately trying to do is build a spell checker using a trie structure. if I have the following input `aaaappplllee`, i should have the corrected spelling `apple`. – locoboy Oct 17 '11 at 22:10
1

@cfarm54 The simplest solution is to collapse all sequences of identical characters into one when taking the input **and** in your word dictionary. Then try to find a match. You will never get a false negative, while remaining fairly efficient. – 999999 Oct 17 '11 at 22:12
all sequences of identical characters EXCEPT `\n` – John Machin Oct 17 '11 at 22:12
@JohnMachin I am working on the assumption that the OP is working with words, not sentences. – 999999 Oct 17 '11 at 22:13
@cfarm54: so you want to change a correct spelling like `accord` to the incorrect `acord`?? – John Machin Oct 17 '11 at 22:14
@999999 but what if i need to print the output of the corrected string? – locoboy Oct 17 '11 at 22:14
@JohnMachin no. accord should return accord and aaaapppplllleee should return apple. – locoboy Oct 17 '11 at 22:15
@999999: Your assumptions are irrelevant. Your sentence "It collapses ..." is incorrect. – John Machin Oct 17 '11 at 22:15
@JohnMachin I think what I want to do is try all versions of repeated letters against the database. so for example, i would call `search(aaardvark)` then if that was wrong i would call `search(aardvark)` then that would work. Similarly if I had aaapple then I would call all permutations of the word `aaapp` + `le`. get it? – locoboy Oct 17 '11 at 22:21
@cfarm54 I updated my answer with a suggested implementation. The way you are trying to go will result in a lot of unnecessary work. A word like `aaaaapllleeeee` would require a lot of tries. – 999999 Oct 17 '11 at 22:21
@999999 are there any corner cases with your proposed dictionary style? I can't think of any off the top of my head, but my assumption is there might be a problem with that. – locoboy Oct 17 '11 at 22:24
@cfarm54 There is one thing I can think of. I added it to my answer. – 999999 Oct 17 '11 at 22:31
@999999 given that you have a list of words, how would you create this dictionary in code? – locoboy Oct 17 '11 at 23:55
@999999 also, this wouldn't solve the ability to replace chars with others. for example, if someone typed `hrllo` we would need to implement something that would refer that to `hello` – locoboy Oct 18 '11 at 00:00
@cfarm54 I answered the question as you presented it. If you want to write a full blown spellchecker, that will take some work. But see this: http://norvig.com/spell-correct.html – 999999 Oct 18 '11 at 00:19

Andrew Clark · Answer 2 · 2011-10-17T22:49:47.417

Here is a solution that will allow you to iterate over all versions of the string with different combinations of repeated letters:

from itertools import product, groupby

# groups == ['aaaa', 'ppp', 'lll', 'ee']
groups = [''.join(g) for c, g in groupby('aaaappplllee')]

# lengths is an iterator that will return all combinations of string lengths to  
# use for each group, starting with [4, 3, 3, 2] and ending with [1, 1, 1, 1]
lengths = product(*[range(x, 0, -1) for x in map(len, groups)])

# Using the lengths from the previous line, this is a generator that yields all
# combinations of the original string with duplicate letters removed
words = (''.join(groups[i][:v] for i, v in enumerate(x)) for x in lengths)

>>> for word in words:
...   print word
... 
aaaappplllee
aaaapppllle
aaaapppllee
aaaappplle
aaaappplee
aaaappple
...
apple
aplllee
apllle
apllee
aplle
aplee
aple

This is not the most efficient solution for finding the correct word, but it is consistent with the OP's original method for finding the match.

score 0 · Answer 3 · answered Oct 17 '11 at 22:32

0

Here's a completely different approach using difflib from the standard library:

import difflib

words = open('/usr/share/dict/words').read().split()

difflib.get_close_matches('aaaappplllee', words, 3, 0.5)
['appalled', 'apple', 'appellate']

difflib.get_close_matches('aaardvarrk', words, 3, 0.5)
['aardvark', 'aardvarks', "aardvark's"]

answered Oct 17 '11 at 22:32

ekhumoro

115,249
20
229
336

do you know how difflib works when you call `get_close_matches`? – locoboy Oct 18 '11 at 00:32
It uses a [difflib.SequenceMatcher](http://docs.python.org/library/difflib.html#difflib.SequenceMatcher) to create a list of candidate matches, and a [priority queue](http://docs.python.org/library/heapq.html#module-heapq) to sort them in order of similarity. – ekhumoro Oct 18 '11 at 01:34

Dropping a letter if it repeats

3 Answers3