61

The English language has a couple of contractions. For instance:

you've -> you have
he's -> he is

These can sometimes cause headache when you are doing natural language processing. Is there a Python library, which can expand these contractions?

Abdulrahman Bres
  • 2,603
  • 1
  • 20
  • 39
Maarten
  • 4,549
  • 4
  • 31
  • 36

8 Answers8

67

I made that wikipedia contraction-to-expansion page into a python dictionary (see below)

Note, as you might expect, that you definitely want to use double quotes when querying the dictionary:

enter image description here

Also, I've left multiple options in as in the wikipedia page. Feel free to modify it as you wish. Note that disambiguation to the right expansion would be a tricky problem!

contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}
arturomp
  • 28,790
  • 10
  • 43
  • 72
32

The answers above will work perfectly well and could be better for ambiguous contraction (although I would argue that there aren't that many ambiguous cases). I would use something more readable and easier to maintain:

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase


test = "Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))
# Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.

It might have some flaws I didn't think about though.

Reposted from my other answer

Yann Dubois
  • 1,195
  • 15
  • 16
  • 1
    Talking of flaws: as 'real' science --> as areal' science – Arun Jan 30 '21 at 05:12
  • 1
    @Arun Indeed but single quotes should only be used inside double quotes. Something like "she said : 'real' science". That's quite rare. but if you happen to have text with a lot of nested quotes then it's not a good idea. alternately you could have a RE that only replaces contractions outside of quoted blocks "..." – Yann Dubois Jan 30 '21 at 12:34
  • At least for American English. I think British English uses single quotation marks more often. – Yann Dubois Jan 30 '21 at 12:41
  • 2
    Another flaw: "This is Amy's house" -> "This is Amy is house" – raphaelmerx Mar 24 '21 at 03:34
  • 1
    I think backslash ("\") is not needed if "r" exists in the string before the pattern converting it into a raw string. – Shayan Shafiq Apr 01 '21 at 04:44
22

I have found a library for this, contractions Its very simple.

import contractions
print(contractions.fix("you've"))
print(contractions.fix("he's"))

Output:

you have
he is
Hammad Hassan
  • 1,192
  • 17
  • 29
  • did you check this library for certain complex contractions mentioned in first answer? – satish silveri Aug 07 '19 at 09:57
  • 3
    Worth noting is that this library doesn't work well with certain special characters, see: https://github.com/kootenpv/contractions/issues/25 – martin36 May 18 '21 at 07:00
  • 1
    @martin36 thanks for the heads-up, however it depends on the dataset and the task, in my case this answer is the solution – Eido95 Jan 30 '22 at 12:07
18

You don't need a library, it is possible to do with reg exp for example.

>>> import re
>>> contractions_dict = {
...     'didn\'t': 'did not',
...     'don\'t': 'do not',
... }
>>> contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
>>> def expand_contractions(s, contractions_dict=contractions_dict):
...     def replace(match):
...         return contractions_dict[match.group(0)]
...     return contractions_re.sub(replace, s)
...
>>> expand_contractions('You don\'t need a library')
'You do not need a library'
alko
  • 46,136
  • 12
  • 94
  • 102
  • 3
    That's a good start, but I guess there are some edge cases: "Jack's a good swimmer" vs "Jack's house is nice.". – Maarten Nov 06 '13 at 07:58
  • 3
    @Maarten a tool to disambiguate those and other cases won't be a library, but a solution consisting at minimum of a decent PoS tagger, and an advanced nlp model, as for example [paraller corpora approach here](http://www.zora.uzh.ch/47923/4/Volk_Sennrich_Contraction_ResolutionV.pdf), or – alko Nov 06 '13 at 08:21
  • @alko "I'd" can be expanded into 'I would' or 'I had'. How would one handle that? – viki.omega9 Feb 22 '14 at 02:12
  • I didnt understand the" '(%s)' % '|' " part. what exactly is happening there? – the_learning_child Mar 19 '21 at 08:40
  • What will be passed in match parameter? – MAC Jul 09 '21 at 07:53
11

This is a very cool and easy to use library for the purpose https://pypi.python.org/pypi/pycontractions/1.0.1.

Example of use (detailed in link):

from pycontractions import Contractions

# Load your favorite word2vec model
cont = Contractions('GoogleNews-vectors-negative300.bin')

# optional, prevents loading on first expand_texts call
cont.load_models()

out = list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the zoo and she'll be home for dinner."], precise=True))
print(out)

You will also need GoogleNews-vectors-negative300.bin, link to download in the pycontractions link above. *Example code in python3.

Joe9008
  • 645
  • 7
  • 14
4

I would like to add little to alko's answer here. If you check wikipedia, the number of English Language contractions as mentioned there are less than 100. Granted, in real scenario this number could be more than that. But still, I am pretty sure that 200-300 words are all you will have for English contraction words. Now, do you want to get a separate library for those (I don't think what you are looking for actually exists, though)?. However, you can easily solve this problem with dictionary and using regex. I would recommend using a nice tokenizer asNatural Language Toolkit and the rest you should have no problem in implementing yourself.

Jack_of_All_Trades
  • 10,942
  • 18
  • 58
  • 88
  • I think this problem is not any more or less difficult than stemming, and there are a couple of libraries for that. Yes, a lot of contractions can be handled with simple search and replace, but some are ambiguous. Most notably "'s". – Maarten Nov 06 '13 at 08:09
1
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    # contraction_mapping is a dictionary of words having the compact form
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) \
                                   if contraction_mapping.get(match) \
                                    else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text
DavideBrex
  • 2,374
  • 1
  • 10
  • 23
Jawwad
  • 21
  • 2
0

Even though this is an old question, I figured I might as well answer since there is still no real solution to this as far as I can see.

I have had to work on this on a related NLP project and I decided to tackle the problem since there didn't seem to be anything here. You can check my expander github repository if you are interested.

It's a fairly badly optimized (I think) program based on NLTK, the Stanford Core NLP models, which you will have to download separately, and the dictionary in the previous answer. All the necessary information should be in the README and the lavishly commented code. I know commented code is dead code, but this is just how I write to keep things clear for myself.

The example input in expander.py are the following sentences:

    ["I won't let you get away with that",  # won't ->  will not
    "I'm a bad person",  # 'm -> am
    "It's his cat anyway",  # 's -> is
    "It's not what you think",  # 's -> is
    "It's a man's world",  # 's -> is and 's possessive
    "Catherine's been thinking about it",  # 's -> has
    "It'll be done",  # 'll -> will
    "Who'd've thought!",  # 'd -> would, 've -> have
    "She said she'd go.",  # she'd -> she would
    "She said she'd gone.",  # she'd -> had
    "Y'all'd've a great time, wouldn't it be so cold!", # Y'all'd've -> You all would have, wouldn't -> would not
    " My name is Jack.",   # No replacements.
    "'Tis questionable whether Ma'am should be going.", # 'Tis -> it is, Ma'am -> madam
    "As history tells, 'twas the night before Christmas.", # 'Twas -> It was
    "Martha, Peter and Christine've been indulging in a menage-à-trois."] # 've -> have

To which the output is

    ["I will not let you get away with that",
    "I am a bad person",
    "It is his cat anyway",
    "It is not what you think",
    "It is a man's world",
    "Catherine has been thinking about it",
    "It will be done",
    "Who would have thought!",
    "She said she would go.",
    "She said she had gone.",
    "You all would have a great time, would not it be so cold!",
    "My name is Jack.",
    "It is questionable whether Madam should be going.",
    "As history tells, it was the night before Christmas.",
    "Martha, Peter and Christine have been indulging in a menage-à-trois."]

So for this small set of test sentences, I came up with to test some edge-cases, it works well.

Since this project has lost importance right now, I am not actively developing this anymore. Any help on this project would be appreciated. Things to be done are written in the TODO list. Or if you have any tips on how to improve my python I would also be very thankful.

Yannick
  • 153
  • 8
  • Thanks Yannick. I have a doubt. If your above sentence is in the format like `'I'm a bad person'`. Your method is not applicable. – M S May 09 '19 at 12:59
  • 1
    Well as long as nltk tokenize can split it up into words everything should be fine, but the other answers probably provide better solutions anyway. – Yannick May 10 '19 at 18:31