2

I have a Dataframe with a column called 'cleaned_tweet'. This column consists of tweets with several abbreviations and I want to replace those abbreviations with proper English words. For that, I have prepared a dictionary called 'slangs' where abbr. is the key and the desired English phrase/word as the value and I want to replace all the occurrences of those abbr. with their values in the dictionary. I have looked for several other solutions on stackoverflow but none of them seems to be working. Here is what I have tried. I am using a nested for loop and I believe I am quite close to the solution but I'm doing something wrong, which I can't seem to figure out.

Here's the nested loop:

for i in range(len(train_test_set)):
    for j in slangs:
        train_test_set['cleaned_tweet'][i] = train_test_set['cleaned_tweet'][i].replace(j, slangs[j])

when I executed this code and printed print(train_test_set['cleaned_tweet][0]), I got an unexpected output like this:

"#mopanthank whyour | hi | years oldwhyour | hi | years oldhesitationospecial editekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduohesitationents | rapper from atalk later | ekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduoue loversatileionwhyes | yeah | yes | your | hi | years oldu | team leaderantaonwhysomethingop it | somethingwhyour | hi | years oldupid idiotake careal edwhyour | hi | years olducatekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye..."

It seems many unwanted values are being appended to the cells. The output size is really big, so I can't copy it all here. Here is the structure of my dataset and dictionary before executing the code:

enter image description here

enter image description here

Can someone tell me what am I doing wrong?

Utkarsh Saboo
  • 57
  • 2
  • 9

2 Answers2

1

You can try using the dictionary along with the map() function. Something like this:

slangs = {'abbr1': 'word1', .........}
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].map(slangs)

If you have multiple abbreviations for the same word, you can try defining the dictionary with the words as the keys and the lists of the respective abbreviations as the values. You can then swap the keys and values and follow the same approach. Something like this:

# define the dictionary with the words as the keys and the lists of the respective abbreviations as the values
slangs = {'word1': ['abbr11', 'abbr12', ....], 'word2': ['abbr21', 'abbr22',..]}
#swap keys in slangs: http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in slangs.items() for k in oldv}
train_test_set['cleaned_tweet']  = train_test_set['cleaned_tweet'].map(slangs)
Sultan Singh Atwal
  • 810
  • 2
  • 8
  • 19
1

I suggest using a Series.str.replace method that supports a callable as the replacement argument.

First, define a dictionary where keys are the search expressions, and values are the texts to replace with:

slangs = { 'lng1': 'val1', 'lng2': 'val2' }

Then, use

rx = r'\b(?:{})\b'.format("|".join(slangs.keys())
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].str.replace(rx), lambda x: slangs[x.group()])

Here, the rx will be a dynamically formed regex of a \b(?:abc|def|ghi|...)\b type where \b are word boundaries. This will work if you have search words that are made of letters, digits or underscores. See other variations of this dynamic pattern building to cover more scenarios. Once there is a match found, it's passed to the lambda expression and lambda x: slangs[x.group()] returns the dictionary value for the found key.

If you have thousands of dictionary items, use this solution to build the regex trie.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563