0

How can I 'normalise'

word = 'yeeeessssssss'

to

word = 'yes'
Katriel
  • 120,462
  • 19
  • 136
  • 170
user1397595
  • 147
  • 1
  • 5
  • 14
  • 1
    Is this about a specific case? Or a generalization for other words that could also map to 'yes'? – Beau Grantham Jun 11 '12 at 14:38
  • 6
    well if its just the case of yeeesss can't you just remove the repeated characters eee and sss? – pyCthon Jun 11 '12 at 14:38
  • A few questions for clarity: Does the word in question have to be yes, and is the "incorrect" input always repeated letters? – Dave Jun 11 '12 at 14:40
  • 6
    Unless you only need to "normalize" "yes", you would need full English dictionary, otherwise my comment would look like "You would ned ful English dictionary, otherwise my coment would lok like..." ;D – Griwes Jun 11 '12 at 14:41
  • @BeauGrantham Well, I mean it's just a generalization. I want to normalize other words to, like "nammme" to be "name" and such. – user1397595 Jun 11 '12 at 14:45
  • It is a terrible problem. Apart from a dictionary + a measure for word-to-word distances, you could use n-grams to select the "most probable" candidates in a given context. (I expect: N>=3 ) BTW: as a first step you could use letter n-grams. – wildplasser Jun 11 '12 at 14:48
  • 4
    How do you want to handle "too"? Should it become "to"? What about "fussing/fusing", "coppers/coopers/copers", "stripped/striped", "sloops/slops", "tapping/taping", "taxiing/taxing", "wagged/waged", etc.? If you want to only normalize in case of a lack of ambiguity, then as said above, you're going to need an entire English dictionary. – DSM Jun 11 '12 at 14:49
  • Wouldn't this be "unstuttering" rather than "normalization"? – avakar Jun 11 '12 at 14:53

4 Answers4

14

It's impossible to answer your question without more information. As you've stated it, you want to remove duplicates from an iterable. You can do that with itertools.groupby:

>>> "".join(c for c, _ in groupby("yeeessssss"))
'yes'

Of course, that will remove all duplicates:

>>> dedupe = lambda s: "".join(c for c, _ in groupby(s))
>>> dedupe("hello")
'helo'
>>> dedupe("Mississippi")
'Misisipi'

I think your question is probably much more difficult; namely, how to normalise words which might have duplicate letters into actual English words. This is basically impossible to do precisely -- what would beeeeeee or feeeed become? -- but, with a lot of effort, you could probably approximate it by any of various heuristics.

One simple one would be to see if the word is in a dictionary, and if not, remove duplicate letters one at a time until it is. This will be very inefficient, but might work.

Another way would be to use a natural-language library to convert the word to some "normal form". This might be by how it sounds, how it is spelled, or something else. You could then find the closest word to that normal form and use it to give your de-duplicated word.

Yet another way would be to define some sort of "modification distance" between strings, where you assign a fixed cost to each of the operations "delete a character", "insert a character", and "modify a character". You could then compute the closest word to the input under this metric. This is a well-studied problem because it is used in bioinformatics, and there is an elegant dynamic programming approach to it. Unfortunately, it's also really quite challenging to work out (a related question was a several-week coursework project in my undergraduate degree).


;tl,dr

Just removing duplicates is easy. Finding the best approximation as an English word is Very Hard.

Katriel
  • 120,462
  • 19
  • 136
  • 170
4

IF by normalizing, you mean remove repeated characters, this should work:

re.sub(r'(\w)\1+', r'\1', 'yeeeesssss')  // yes
user278064
  • 9,982
  • 1
  • 33
  • 46
3

This seems similar to what you'd need to do using a spell checker.

One often used solution is to use Soundex functions to reduce the word to "what it sounds like" and then compare it against a known valid-word dictionary. I don't think it would be fool-proof, but it's an idea that may start you off in the right direction.

http://en.wikipedia.org/wiki/Soundex

Soundex isn't the only option. There are also Metaphone and several other similar algorithms that might work.

There's a previous question about Soundex with Python here: Soundex algorithm in Python (homework help request)

The hardest part is probably finding a good dictionary, but I've had luck with this search: http://www.bing.com/search?q=download+word+list&qs=n&form=QBRE&pq=download+word+list&sc=8-18&sp=-1&sk=

No matter what you do, it's not going to be perfect. As pointed out by some of the comments, there are a lot of complexites to deal with in the English language (and any language, for that matter). Differentiating between "too" and "to", for example depend on the context. Microsoft and others have put teams of developers through years of development into spell-checkers, and spell-checkers are still not able to do it correctly 100% of the time, and still require human intervention. I think you'll face the same issue with word normalization.

Community
  • 1
  • 1
David
  • 72,686
  • 18
  • 132
  • 173
1

use the enchant module to check if the returned word is a english word or not :

import enchant,itertools
d_us= enchant.Dict("en_US")
d_uk= enchant.Dict("en_UK")
words=[]
teks=teks='yeeeessssssss'
for x in itertools.permutations(set(teks)):
    if d_us.check(''.join(x)) or d_uk.check(''.join(x)):
      words.append(''.join(x))
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504