4

i have a list of sentences and basically my aim is to replace all diff occurrences of prepositions in the form "opp,nr,off,abv,behnd" with their correct spellings "opposite,near,above,behind" and so on. The soundex code of the words are same so i need to build an expression to iterate over this list word by word and if the soundex is same, replace it with the right spelling.

An example - ['Jack was standing nr the tree' ,
'they were abv everything he planned' ,
'Just stand opp the counter' ,
'Go twrds the gas station']

so i need to replace words nr,abv ,opp and twrds with their right full forms. The soundex code of towards and twrds is the same , so it should be replaced.
i need to iterate over this list..
here's the soundex algorithm :

import string

allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)

def soundex(source):
    "convert string to Soundex equivalent"

    # Soundex requirements:
    # source string must be at least 1 character
    # and must consist entirely of letters
    if (not source) or (not source.isalpha()):
    return "0000"

    # Soundex algorithm:
    # 1. make first character uppercase
    # 2. translate all other characters to Soundex digits
    digits = source[0].upper() + source[1:].translate(charToSoundex)

    # 3. remove consecutive duplicates
    digits2 = digits[0]
    for d in digits[1:]:
        if digits2[-1] != d:
           digits2 += d

    # 4. remove all "9"s
    # 5. pad end with "0"s to 4 characters
    return (digits2.replace('9', '') + '000')[:4]

if __name__ == '__main__':
   import sys
   if sys.argv[1:]:
      print soundex(sys.argv[1])
   else:
    from timeit import Timer
    names = ('Woo', 'Pilgrim', 'Flingjingwaller')
    for name in names:
        statement = "soundex('%s')" % name
        t = Timer(statement, "from __main__ import soundex")
        print name.ljust(15), soundex(name), min(t.repeat())

am a newbie ,so in case there's another approach you could suggest , it would be appreciated.. thanks.

Hypothetical Ninja
  • 3,920
  • 13
  • 49
  • 75
  • Could you fix your indentation? – Bach Feb 07 '14 at 13:18
  • fixed :) . and also , should i create my own file consisting of the right spellings?? – Hypothetical Ninja Feb 07 '14 at 13:46
  • That's not fixed. From `def` to `return` it should be indented. – Bach Feb 07 '14 at 13:52
  • It is actually not clear to me what exactly you're asking... – Bach Feb 07 '14 at 14:03
  • check it now.. i need to iterate over those sentences, compute the soundex of each word and compare it with the soundex code of my prep references.. in case it matches with my reference, the word will be replaced with the reference word.. example: the soundex of near and nr is same, so the algorithm should replace 'nr' with near when it finds the soundex code of nr matching that of "near".. – Hypothetical Ninja Feb 08 '14 at 03:08
  • 2
    https://pypi.python.org/pypi/jellyfish/0.2.1 – Joel Cornett Feb 08 '14 at 03:16
  • thnx joel ,how do i iterate over the words of each sentence in list...? secondly , how is it different from iterating over a series type of data (different lines in a dataframe) .. i always had these questions and got with it by trial and error. want to know how we really do that.. – Hypothetical Ninja Feb 08 '14 at 03:29
  • iterating using: "for line in data: for word in line: print word " gives me an alphabet per line.. i added a comma in front of print but i get words this way "n e a r" .. how do i get a word as a whole?? – Hypothetical Ninja Feb 08 '14 at 03:38
  • i got it.. used "for word in line.split():" and i got the whole word. next step is to compute soundex of word and compare it with the soundex of preps list.. – Hypothetical Ninja Feb 08 '14 at 03:43
  • PHP has its own soundex() function. You should use that, for starters. – Mike Sherrill 'Cat Recall' Feb 08 '14 at 14:46
  • no ,I am using python.. starting off with an entirely new language wont make much sense.. i've managed to get this working using python.. you could check the code below.. thanx for your suggestion. – Hypothetical Ninja Feb 09 '14 at 09:04

1 Answers1

0

I'll use enchant module:

import enchant
d = enchant.Dict("en_US")

phrase = ['Jack was standing nr the tree' ,
'they were abv everything he planned' ,
'Just stand opp the counter' ,
'Go twrds the gas station']

output = []
for section in phrase:
    sect = ''
    for word in section.split():
        if d.check(word):
            sect += word + ' '
        else:
            for correct_word in d.suggest(word):
                if soundex(correct_word) == soundex(word):
                    sect +=  correct_word + ' '
    output.append(sect[:-1])
Hrabal
  • 2,403
  • 2
  • 20
  • 30