String preprocessing

Question

I'm dealing with a list of strings that may contain some additional letters to its original spelling, for example:

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']

I want to pre-process these strings so that they are spelt correctly, to retrieve a new list:

cleaned_words = ['why', 'hey', 'alright', 'cool', 'monday']

The length of the sequence of the duplicated letter can vary, however, obviously cool should maintain its spelling.

I'm unaware of any python libraries that do this, and I'd preferably like to try and avoid hard coding it.

I've tried this: http://norvig.com/spell-correct.html but the more words you put in the text file, it seems there's more chance of it suggesting the incorrect spelling, so it's never actually getting it right, even without the removed additional letters. For example, eel becomes teel...

Thanks in advance.

Since the task is very language dependent, python by itself cannot do it for you. Try looking up some spelling correction packages for example https://pypi.python.org/pypi/autocorrect/0.1.0 — javad, Feb 05 '16 at 14:09
Please take a look at this post: http://stackoverflow.com/questions/4500752/python-check-whether-a-word-is-spelled-correctly. I suggest: 1) check each word for spelling. 2) If it is not right, use a loop to try removing a duplicated letter until the spelling is right. — Quinn, Feb 05 '16 at 14:10
I think that you won't get any real answer, unless you provide some code that you have written, or any reasoning - algorithms/papers/links used/thought/you have in mind. — Markon, Feb 05 '16 at 14:14
Most UNIXes should have a list of words in `/usr/share/dict/words`. Use it, if you need it — Andrea Corbellini, Feb 05 '16 at 15:38

score 2 · Answer 1 · answered Feb 05 '16 at 14:18

If it's only repeated letters you want to strip then using the regular expression module re might help:

>>> import re
>>> re.sub(r'(.)\1+$', r'\1', 'cool')
'cool'
>>> re.sub(r'(.)\1+$', r'\1', 'coolllll')
'cool'

(It leaves 'cool' untouched.)

For leading repeated characters the correct substitution would be:

>>> re.sub(r'^(.)\1+', r'\1', 'mmmmonday')
'monday'

Of course this fails for words that legitimately start or end with repeated letters ...

Peter · Accepted Answer · 2016-02-12T15:14:54.877

1

If you were to download a text file of all english words to check against, this is another way that could work.

I've not tested it but you get the idea. It iterates through the letters, and if the current letter matches the last one, it'll remove the letter from the word. If it narrows down those letters to 1, and there is still no valid word, it'll reset the word back to normal and continue until the next duplicate characters are found.

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']
import urllib2
word_list = set(i.lower() for i in urllib2.urlopen('https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt').read().split('\n'))

found_words = []
for word in (i.lower() for i in words):

    #Check word doesn't exist already
    if word in word_list:
        found_words.append(word)
        continue

    last_char = None
    i = 0
    current_word = word
    while i < len(current_word):

        #Check if it's a duplicate character
        if current_word[i] == last_char:
            current_word = current_word[:i] + current_word[i + 1:]

        #Reset word if no more duplicate characters
        else:
            current_word = word
            i += 1
            last_char = current_word[i]

        #Word has been found
        if current_word in word_list:
            found_words.append(current_word)
            break

print found_words
#['why', 'hey', 'alright', 'cool', 'monday']

edited Feb 12 '16 at 15:14

answered Feb 05 '16 at 14:25

Peter

3,186
3
26
59

Upvote. Like the idea. The current output of this is `['whyyy', 'heyy', 'alrighttt', 'cool', 'mmmmonday']` so it's removing some end characters but not all. Any idea why? – user47467 Feb 05 '16 at 14:38
Ah sorry, I was running through each character once as if the word was staying the same size, but it's not so I've managed to fix it. For the record, you need to check `word_list` is correct too, I had to do `f.read().split('\r\n')` with it being a text file with each word on a new line. – Peter Feb 06 '16 at 00:00
i get a string index out of range error for some reason. On the `last_char = current_word[i]` line – user47467 Feb 11 '16 at 17:51
Now I'm just guessing haha, but this might might work. If it still doesn't work, send over the word list you're using and I'll try figure it out :) I also just noticed 'cool' didn't work properly since there is also 'col', so I added in a bit to check for the original word too :P – Peter Feb 11 '16 at 21:20
Now it returns an empty list? :P I'm unsure how to give you the text file, but these are the words that I use: https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt – user47467 Feb 12 '16 at 10:25
I'll update the question to actually use that link, my text file broke lines with `/r/n` but that seems to be just `/n`, so that might be the problem :P – Peter Feb 12 '16 at 15:12

score 0 · Answer 3 · answered Feb 05 '16 at 16:47

Well, a crude way:

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']

res = []
for word in words:
    while word[-2]==word[-1]:
        word = word[:-1]
    while word[0]==word[1]:
        word = word[1:]
    res.append(word)
print(res)

Result: ['why', 'hey', 'alright', 'cool', 'monday']

String preprocessing

3 Answers3