Python regex: tokenizing English contractions

Question

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].

The nltk module does not seem to be up to the task however as:

"I wouldn't've done that."

tokenizes as:

['I', "wouldn't", "'ve", 'done', 'that', '.']

where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]

After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:

n't, 've, 'd, 'll, 's, 'm, 're

But the token "'ve" can also follow other contractions such as:

'd've, n't've, and (conceivably) 'll've

At the moment, I am trying to wrangle this regex:

\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b

However, this pattern also matches the badly formed:

"wouldn't've've"

It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.

I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.

Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, \b in a character class matches a backspace and there doesn't seem to be a way around this.

EDIT:

Here's the output:

>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]

I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading \b, then I don't know what would be matching the character class [a-zA-Z]+.

AMDcze · Answer 1 · 2015-01-20T21:39:48.133

3

(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])

EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.

edited Jan 20 '15 at 21:39

answered Jan 20 '15 at 20:29

AMDcze

516
3
13

Thanks. But, this gets confused on "She'll wish she hadn't've've done that." and returns a lot of extraneous groups other times. – Schemer Jan 20 '15 at 21:10
Can you provide some examples so we know what to test for? I edited my code so it works with some of my examples and yours. Demo: https://regex101.com/r/iV4cX6/1 – AMDcze Jan 20 '15 at 21:41
Your look ahead/behind assertions pointed me to this: `\b(?<!')[a-zA-Z]+('s|'m|'re|'ve)|(?:('ll|'d|n't)('ve)?)(?!')\b` which solves the task as it is at the moment. The apostrophes _were_ being matched as word boundaries but at the beginning of 've as well as at the end. Also, I might have died and gone to hell before I noticed that mismatched bracket. Thanks! – Schemer Jan 20 '15 at 21:49

score 3 · Answer 2 · answered Jan 20 '15 at 20:33

3

You can use the following complete regexes :

import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."

print [i for i in pattern.split(s) if i]

result :

['I', 'would', "n't", "'ve", 'done', 'that.']

answered Jan 20 '15 at 20:33

Mazdak

105,000
18
159
188

1

Thanks. But this also matches the poorly formed "wouldn't've've" which I would like to ignore. – Schemer Jan 20 '15 at 21:07

score 2 · Answer 3 · answered Jan 21 '15 at 02:28

>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']

so:

>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']

score 1 · Answer 4 · answered Jan 20 '15 at 20:36

1

You can use this regex to tokenize the text:

(?:(?!.')\w)+|\w?'\w+|[^\s\w]

Usage:

>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']

answered Jan 20 '15 at 20:36

Aran-Fey

39,665
11
104
149

1

Thanks. But this pattern doesn't exclude the poorly formed "wouldn't've've". – Schemer Jan 20 '15 at 20:51

score 0 · Answer 5 · answered Mar 07 '17 at 02:41

Here a simple one

text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
    .replace("'s ", ' is ').replace("'m ", ' am ') \
    .replace("'ll ", ' will ').replace("'d ", ' would ') \
    .replace("'re ", ' are ').replace("'ve ", ' have ')

Python regex: tokenizing English contractions

5 Answers5

Linked

Related