15

I am using python to clean a given sentence. Suppose that my sentence is:

What's the best way to ensure this?

I want to convert:

What's -> What is

Similarly,

 must've -> must have

Also, verbs to original form,

told -> tell

Singular to plural, and so on.

I am currently exploring textblob. But not all of the above is possible using it.

Cœur
  • 37,241
  • 25
  • 195
  • 267
learner
  • 4,614
  • 7
  • 54
  • 98
  • You haven't actually asked a question. But if you're asking for a library recommendation, that's off-topic for SO. – PM 2Ring Mar 25 '17 at 15:14

4 Answers4

32

For the first question, there isn't a direct module that does that for you so you will have to build your own, first you will need a contraction dictionary like this one:

contractions = {
"ain't": "am not / are not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is",
"i'd": "I had / I would",
"i'd've": "I would have",
"i'll": "I shall / I will",
"i'll've": "I shall have / I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

Then write some code to modify your text according to the dictionary, something like this:

text="What's the best way to ensure this?"
for word in text.split():
    if word.lower() in contractions:
        text = text.replace(word, contractions[word.lower()])
print(text)

For your second question on changing verb tense, nodebox's linguistics library is very popular and highly recommended for such tasks. After downloading their zip file, unzip it and copy it to python's site-package directory. After doing that, you can write something like this:

import en
for word in text.split():
    if en.is_verb(word.lower()):
        text = text.replace(word, en.verb.present(word.lower()))
print text

Note: this library is only for Python 2 since it does not yet offer support for python 3.

user1211
  • 1,507
  • 1
  • 18
  • 27
Taku
  • 31,927
  • 11
  • 74
  • 85
  • It seems faster to change all 've to have, 'll to will, 're to are, ect... Or am I missing something ? – 0xmax May 04 '18 at 18:30
  • 1
    @Totem this is what I do in my answer below – Yann Dubois Feb 04 '19 at 15:32
  • 1
    i have same approach for expanding contractions, but i am struck at " belongs to " contractions which also end with " 's ". For example, he is driving tom's car. what should be the rule here? Two options to consider is replace 's with empty character or convert it to a rule like tom's -> tom is. How to solve for this type of words? – Satyaaditya May 20 '20 at 06:37
14

The answers above will work perfectly well and could be better for ambiguous contraction (although I would argue that there aren't that much of ambiguous cases). I would use something that is more readable and easier to maintain:

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase


test = "Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))
# Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.

It might have some flaws I didn't think about though.

Yann Dubois
  • 1,195
  • 15
  • 16
  • 3
    +1 I like this approach. One bug I found, which can easily be fixed, is that "can't" get's turned into "ca not"; a possible fix is to add `phrase = re.sub(r"can\'t", "can not", phrase)` under specific – gionni Dec 11 '17 at 09:26
  • 2
    @gionni thanks I must have thought (wrongly) that can't would be an example of this case : `phrase = re.sub(r"\'t", " not", phrase)`. But you are definitely right that it would end up in ca not due to the first case. Thanks for pointing it out, I've updated my answer ! – Yann Dubois Dec 11 '17 at 16:02
  • 1
    I think the following words need to be added to the specifics: ain't,shan't,sha'n't,ma'am,y'all – Harshad Vyawahare Aug 06 '19 at 16:52
  • 1
    Doesn't this turn `'my comment'` into `amy comment'`? – 989 Jul 09 '20 at 09:25
  • Indeed but single quotes should only be used inside double quotes. Something like "she said : 'my comment' ". That's quite rare. but if you happen to have text with a lot of nested quotes then it's not a good idea. alternately you could have a RE that only replaces contractions outside of quoted blocks "..." . – Yann Dubois Jul 09 '20 at 16:09
2

If you want to roll your own, you can use this for contraction mapping:

http://alicebot.blogspot.com/2009/03/english-contractions-and-expansions.html

And this for verb replacements:

http://www.lexically.net/downloads/BNC_wordlists/e_lemma.txt

For the latter, you would probably want to generate a reverse dictionary mapping all the conjugated forms to their original (perhaps keeping in mind that there could be ambiguous forms, so make sure to check for these and handle them properly).

Julien
  • 5,243
  • 4
  • 34
  • 35
1

This might not be suitable in your specific solution but (for general knowledge) there is a great open-source software library called Spacy. It makes life easier in similar cases. To demonstrate:

texts = ["what's", "must've", "told"]

for text in texts:
   doc = nlp(text)
   lemmatized_text = ' '.join([token.lemma_ for token in doc])
   print(lemmatized_text)

Outputs:

what be
must have
tell