How can I 'normalise'
word = 'yeeeessssssss'
to
word = 'yes'
How can I 'normalise'
word = 'yeeeessssssss'
to
word = 'yes'
It's impossible to answer your question without more information. As you've stated it, you want to remove duplicates from an iterable. You can do that with itertools.groupby
:
>>> "".join(c for c, _ in groupby("yeeessssss"))
'yes'
Of course, that will remove all duplicates:
>>> dedupe = lambda s: "".join(c for c, _ in groupby(s))
>>> dedupe("hello")
'helo'
>>> dedupe("Mississippi")
'Misisipi'
I think your question is probably much more difficult; namely, how to normalise words which might have duplicate letters into actual English words. This is basically impossible to do precisely -- what would beeeeeee
or feeeed
become? -- but, with a lot of effort, you could probably approximate it by any of various heuristics.
One simple one would be to see if the word is in a dictionary, and if not, remove duplicate letters one at a time until it is. This will be very inefficient, but might work.
Another way would be to use a natural-language library to convert the word to some "normal form". This might be by how it sounds, how it is spelled, or something else. You could then find the closest word to that normal form and use it to give your de-duplicated word.
Yet another way would be to define some sort of "modification distance" between strings, where you assign a fixed cost to each of the operations "delete a character", "insert a character", and "modify a character". You could then compute the closest word to the input under this metric. This is a well-studied problem because it is used in bioinformatics, and there is an elegant dynamic programming approach to it. Unfortunately, it's also really quite challenging to work out (a related question was a several-week coursework project in my undergraduate degree).
;tl,dr
Just removing duplicates is easy. Finding the best approximation as an English word is Very Hard.
IF by normalizing, you mean remove repeated characters, this should work:
re.sub(r'(\w)\1+', r'\1', 'yeeeesssss') // yes
This seems similar to what you'd need to do using a spell checker.
One often used solution is to use Soundex functions to reduce the word to "what it sounds like" and then compare it against a known valid-word dictionary. I don't think it would be fool-proof, but it's an idea that may start you off in the right direction.
http://en.wikipedia.org/wiki/Soundex
Soundex isn't the only option. There are also Metaphone and several other similar algorithms that might work.
There's a previous question about Soundex with Python here: Soundex algorithm in Python (homework help request)
The hardest part is probably finding a good dictionary, but I've had luck with this search: http://www.bing.com/search?q=download+word+list&qs=n&form=QBRE&pq=download+word+list&sc=8-18&sp=-1&sk=
No matter what you do, it's not going to be perfect. As pointed out by some of the comments, there are a lot of complexites to deal with in the English language (and any language, for that matter). Differentiating between "too" and "to", for example depend on the context. Microsoft and others have put teams of developers through years of development into spell-checkers, and spell-checkers are still not able to do it correctly 100% of the time, and still require human intervention. I think you'll face the same issue with word normalization.
use the enchant
module to check if the returned word is a english word or not :
import enchant,itertools
d_us= enchant.Dict("en_US")
d_uk= enchant.Dict("en_UK")
words=[]
teks=teks='yeeeessssssss'
for x in itertools.permutations(set(teks)):
if d_us.check(''.join(x)) or d_uk.check(''.join(x)):
words.append(''.join(x))