1

I managed to do that but the case I'm struggling with is when I have to consider 'color' equal to 'colour' for all such words and return count accordingly. To do this, I wrote a dictionary of common words with spelling changes in American and GB English for this, but pretty sure this isn't the right approach.

 ukus=dict()      ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE',
'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}

This is the dictionary I wrote to check the values. As you can see this is degrading the performance. Pyenchant isn't available for 64bit python. Someone please help me out. Thank you in advance.

  • 1
    Are you needing to do a 2-way check, as in check whether the US or UK is supplied? Your problem statement isn't very clear as to what you need to return based on what you need to submit. – dblclik Sep 22 '16 at 13:29
  • How are you actually using that dictionary? Why does it contain the conversions going both ways? BTW, `ukus=dict()` creates an empty dictionary, but you then discard that dictionary & replace it with a new one. – PM 2Ring Sep 22 '16 at 13:35
  • I think your answer may lie within the NLTK package [http://www.nltk.org/]. Perhaps stemming and lemmatizing will help? But if not NLTK is rich with text manipulation and changing. [http://stackoverflow.com/questions/771918/how-do-i-do-word-stemming-or-lemmatization] – MattR Sep 22 '16 at 13:42
  • @dblclik I have to return the count of the words in the string. But sometimes the given string might have US version of the word and user might be checking for UK version of it. In those cases the count must include all versions of the word. – sleepcoffeedelight Sep 24 '16 at 07:56
  • @PM2Ring I made the conversions two way because the string might have words stored in either US or UK English and search can be in either way too. If I store it in only one way then I'm unable to access keys using values – sleepcoffeedelight Sep 24 '16 at 07:59
  • @MattR I'm very new to Python and did come across the nltk package but understanding it seemed to require even deeper understanding of the language. – sleepcoffeedelight Sep 24 '16 at 08:00

2 Answers2

0

Step 1: Create a temporary string and then replace all the words with values of your dict with it's corresponding keys as:

>>> temp_string = str(my_string)
>>> for k, v in ukus.items():
...     temp_string = temp_string.replace(" {} ".format(v), " {} ".format(k))  # <--surround by space " " to replace only words

Step 2: Now, in order to find words in the string, firstly split it into list of words and then use itertools.Counter() to get count of each element in the list. Below is the sample code:

>>> from collections import Counter
>>> my_string = 'Hello World! Hello again. I am saying Hello one more time'
>>> count_dict = Counter(my_string.split())
# Value of count_dict:
# Counter({'Hello': 3, 'saying': 1, 'again.': 1, 'I': 1, 'am': 1, 'one': 1, 'World!': 1, 'time': 1, 'more': 1})
>>> count_dict['Hello']
3

Step 3: Now, since you want the count of both "colour" and "color" in your dict, re-iterate the dict to add those values, and the missing values as "0"

for k, v in ukus.items():
    if k in count_dict:
        count_dict[v] = count_dict[k]
    else:
        count_dict[v] = count_dict[k] = 0   
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
  • 1
    This will count "colour" separately from "color", which is **not** what the OP wants. – PM 2Ring Sep 22 '16 at 13:33
  • That still won't work if you use the OP's `ukus` dict: it will translate "colour" to "color", but also does the reverse & translates "color" to "colour". Also, doing a simple `.replace` will do replacements inside words, so "retire" gets translated to "retyre". – PM 2Ring Sep 22 '16 at 13:43
  • Sorry for my poor understanding. Updated the answer again to satisfy that requirement as well – Moinuddin Quadri Sep 22 '16 at 13:51
  • Thank you very much! temp_string when I print still shows the original string but nothing is being replaced. Im only talking about the first step – sleepcoffeedelight Sep 24 '16 at 08:26
  • UPDATE: I don't need the count of all dictionary elements. Just those in the string. That is all – sleepcoffeedelight Sep 24 '16 at 10:54
0

Okay, I think I know enough from your comments to provide this as a solution. The function below allows you to choose either UK or US replacement (it uses US default, but you can of course flip that) and allows for you to either perform minor hygiene on the string.

import re

ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE'}
usuk={'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}

def str_wd_count(my_string, uk=False, hygiene=True):
    us = not(uk)
    # if the UK flag is TRUE, default to UK version, else default to US version
    print "Using the "+uk*"UK"+us*"US"+" dictionary for default words"

    # optional hygiene of non-alphanumeric characters for pure word counting
    if hygiene:
        my_string = re.sub('[^ \d\w]',' ',my_string)
        my_string = re.sub(' {1,}',' ',my_string)

    # create a list of the unqique words in the text
    ttl_wds = [ukus.get(w,w) if us else usuk.get(w,w) for w in my_string.upper().split(' ')]
    wd_counts = {}
    for wd in ttl_wds:
        wd_counts[wd] = wd_counts.get(wd,0)+1

    return wd_counts

As a sample of use, consider the string

str1 = 'The colour of the dog is not the same as the color of the tire, or is it tyre, I can never tell which one will fulfill'

# Resulting sorted dict.items() With Default Settings
'[(THE,5),(TIRE,2),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(IT,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1)]'

# Resulting sorted dict.items() With hygiene=False
'[(THE,5),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1),(IT,1),(TYRE,,1)]'

# Resulting sorted dict.items() With UK Swap, hygiene=True
'[(THE,5),(OF,2),(IS,2),(TYRE,2),(COLOUR,2),(WHICH,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(OR,1),(WILL,1),(AS,1),(CAN,1),(TELL,1),(NOT,1),(FULFILL,1),(ONE,1),(IT,1)]'

# Resulting sorted dict.items() With UK Swap, hygiene=False
'[(THE,5),(OF,2),(IS,2),(COLOUR,2),(ONE,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(FULFILL,1),(TYRE,,1),(IT,1),(OR,1)]'

You can use the resulting dictionary of word counts in any way you'd like, and if you need the original string with the modifications added it is easy enough to modify the function to also return that.

dblclik
  • 406
  • 2
  • 8