Tokenizer expanding extractions

Question

I am looking for a tokenizer that is expanding contractions.

Using nltk to split a phrase into tokens, the contraction is not expanded.

nltk.word_tokenize("she's")
-> ['she', "'s"]

However, when using a dictionary with contraction mappings only, and therefore not taking any information provided by surrounding words into account, it's not possible to decide whether "she's" should be mapped to "she is" or to "she has".

Is there a tokenizer that provides contraction expansion?

Does 's becoming "be" (lemma) suits you? – Tiago Duque Sep 12 '19 at 12:24 — Tiago Duque, Sep 12 '19 at 12:24

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

You can do rule based matching with Spacy to take information provided by surrounding words into account. I wrote some demo code below which you can extend to cover more cases:

import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.matcher import Matcher

sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"]

nlp = spacy.load('en_core_web_sm')

def normalize(sentence):
    ans = []
    doc = nlp(sentence)


    #print([(t.text, t.pos_ , t.dep_) for t in doc])
    matcher = Matcher(nlp.vocab)
    pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}]
    matcher.add("case_has", None, pattern)
    pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}]
    matcher.add("case_has", None, pattern)
    pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}]
    matcher.add("case_is", None, pattern)
    pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}]
    matcher.add("case_is", None, pattern)
    # .. add more cases

    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  
        for idx, t in enumerate(doc):
            if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end:
                ans.append("has")
                continue
            if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end:
                ans.append("is")
                continue
            else:
                ans.append(t.text)
    return(' '.join(ans))

for s in sentences:
    print(s)
    print(normalize(s))
    print()

output:

now she's a software engineer
now she is a software engineer

she's got a cat
she has got a cat

he's a tennis player
he is a tennis player

He thinks that she's 30 years old
He thinks that she is 30 years is old

Nice. I tried [pycontractions](https://pypi.org/project/pycontractions/) but it fails too often even with big models. — Wiktor Stribiżew, Sep 12 '19 at 12:52
great idea, works out perfectly and solved my problem, thanks a lot! — billie404, Sep 18 '19 at 08:38

Tokenizer expanding extractions

1 Answers1