0

I tired to follow this question to create a regex expression that separates contractions from the word.

Here is my attempt:

 line = re.sub( r'\s|(n\'t)|\'m|(\'ll)|(\'ve)|(\'s)|(\'re)|(\'d)', r" \1",line) #tokenize contractions

However, only the first match is tokenized. For example: should've can't mustn't we'll changes to should ca n't must n't we

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
M.A.G
  • 559
  • 2
  • 6
  • 21

2 Answers2

1

\1 refers to the first capturing group!

You could put all the options in the same capturing group:

(n\'t|\'m|\'ll|\'ve|\'s|\'re|\'d)

See a demo here.

For deepening the topic, I suggest you to read Parentheses for Grouping and Capturing.

logi-kal
  • 7,107
  • 6
  • 31
  • 43
1

Another variation without capture groups using the full match \g<0> in the replacement.

Using multiple single chars 'm 's and 'd could shortened using a character class '[msd]

Note that the \' does not have to be escaped when wrapping the pattern in double quotes.

n't|'(?:ll|[vr]e|[msd])

Regex demo

import re

line = "should've can't mustn't we'll"
line = re.sub(r"n't|'(?:ll|[vr]e|[msd])", r" \g<0>", line)
print(line)

Output

should 've ca n't must n't we 'll
The fourth bird
  • 154,723
  • 16
  • 55
  • 70