Solution for regex of english contractions doesn't work

Asked May 31 '18 at 16:41

Active May 31 '18 at 16:58

Viewed 23 times

I am currently fiddling around with Regular Expressions and NLTK (Natural Language Toolkit). I want to tokenize sentences into words and punctuation. Contractions like "can't", "I'll" and so on should be recognised as words as well. I can't seem to find a regular expression that does this.

\w+(\'\w+)?|[!-~]

Why doesn't this regex work? I only get bad results like:

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

For sentences like these:

This is a test. Lulz another sentence. This can't be real.

I am afraid that I haven't understood Regular Expressions?

EDIT:

Code:

import re

re.findall("\w+('\w+)?|[!-~]", "This is a test. Lulz another sentence. This can't be real.")

edited May 31 '18 at 16:58

asked May 31 '18 at 16:41

Dwagner

3

Show the code you actually used the regex to get that bad result. – lurker May 31 '18 at 16:43
1

Use a non-capturing group - `r"\w+(?:'\w+)?|[!-~]"` – Wiktor Stribiżew May 31 '18 at 16:47
You've written `\'w` instead of `'\w`. Why is this closed as a dupe when it's clearly this typo that breaks the regex? @WiktorStribiżew – Norrius May 31 '18 at 16:48
Who has marked this as a duplicate? I don't think the issue lies with the regex funciton but with the expression. – Dwagner May 31 '18 at 16:49
@Norrius That is OP typo. The only fix that is really necessary is a non-capturing group. – Wiktor Stribiżew May 31 '18 at 16:51
`\w+(\'\w+)?|[!-~]` still doesnt work. The non-capturing group doesn't return the contraction as a word. – Dwagner May 31 '18 at 16:52
@New2HTML Yes, the issue is the capturing group. A non-capturing group syntax is **`(?:...)`**. – Wiktor Stribiżew May 31 '18 at 16:53
Yeah now it works, must have done something wrong. But I thought the `?` after the `(\'\w+)` makes it non-capturing? – Dwagner May 31 '18 at 16:56
@New2HTML No, that makes it optional. – melpomene May 31 '18 at 16:57
Could you remove the duplicate? And what is the difference between optional and non-capturing? I can't wrap my head around this. – Dwagner Jun 01 '18 at 09:27

Solution for regex of english contractions doesn't work

0 Answers0