1

The solution below was provided on Stack Overflow here: expanding-english-language-contractions-in-python

It works great for contractions. I tried to extend it to handle slang words but ran into an issue per below. Also, I'd prefer to use 1 solution to handle all word conversions (e.g.: expansions, slang, etc.)

I extended the contractions_dict to also correct slang, see 3rd entry below:

contractions_dict = {
     "didn't": "did not",
     "don't": "do not",
     "ur": "you are"
 }

However, when I do so on words that include a slang term (ur) like "surprise" I get

"syou areprise"

The "you" and "are" embedded above are where the "ru" use to be.

How do you get an exact match on a key in the contractions_dict?

In my code below I tried embedding a more exact word match regex around the "replace" function but received an error "TypeError: must be str, not function".

The code:

import re

contractions_dict = {
     "didn't": "did not",
     "don't": "do not",
     "ur": "you are"
 }

contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
def expand_contractions(s, contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, s)

result = expand_contractions("surprise")
print(result)

# The result is "syou areprise".

# ---
# Try to fix it below with a word match regex around the replace function call. 

contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
def expand_contractions(s, contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(r'(?:\W|^)'+replace+'(?!\w)', s)

# On the line above I get "TypeError: must be str, not function"

result = expand_contractions("surprise")
print(result)
RandomTask
  • 499
  • 8
  • 19

1 Answers1

1

Your problem is that replace is the name of a function, and you are trying to concatenate it to a string, which is why this

return contractions_re.sub(r'(?:\W|^)'+replace+'(?!\w)', s)

is giving you the error you report. When you call sub() you can either supply a replacement string, or the name of a function to call, but you can't combine the two approaches the way you are trying to do.

I would go back to your original approach of supplying sub() a function. I think what you are missing is the special regex sequence \b. It matches the empty string, but only at word boundaries. Like this:

contractions_re = re.compile("|".join(r'(\b%s\b)' % c for c in contractions_dict.keys()))

This gives the following re pattern:

r"(\bdidn't\b)|(\bdon't\b)|(\bur\b)"

This will avoid nasty syou areprises. Note the r'...' string notation. You need it so that the backslashes won't trip you up.

This works with multiple tokens in a string, as it should:

>>> expand_contractions("didn't that surprise you")
'did not that surprise you'

but doing that also shows the limitations of token-by-token substitution of contractions. To begin a question did not that is very 19th-century (and in fact they probably said didn't that even when they wrote did not that). The present-day English for that would be did that not surprise you.

BoarGules
  • 16,440
  • 2
  • 27
  • 44