How to strip string from punctuation except apostrophes for NLP

Question

I am using the below "fastest" way of removing punctuation from a string:

text = file_open.translate(str.maketrans("", "", string.punctuation))

However, it removes all punctuation including apostrophes from tokens such as shouldn't turning it into shouldnt.

The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text. For example for shouldnt the stopwords included are shouldn, shouldn't, t.

I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords. But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.

Is there a way I can leave the apostrophes when doing fast punctuation cleaning?

I didnt know that was possible, something like this? `string.punctuation.replace(" ' ", "")` — asleniovas, Jan 23 '20 at 11:54

score 7 · Accepted Answer · answered Jan 23 '20 at 11:55

7

>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"

answered Jan 23 '20 at 11:55

buran

13,682
10
36
61

Any way to replace the punctuation with a space or other character? – MaxPi Jun 29 '22 at 20:56

funie200 · Answer 2 · 2020-01-23T12:08:33.540

3

Edited from this answer.

import re

s = "This is a test string, with punctuation. This shouldn't fail...!"

text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)

This returns:

This is a test string with punctuation This shouldn't fail

Regex explanation:

[^] matches everything but everything inside the blockquotes
\w matches any word character (equal to [a-zA-Z0-9_])
\d matches a digit (equal to [0-9])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\' matches the character ' literally (case sensitive)
+ matches between one and unlimited times, as many times as possible, giving back as needed

And you can try it here.

edited Jan 23 '20 at 12:08

answered Jan 23 '20 at 12:03

funie200

3,688
5
21
34

1

thanks for the detailed regex explanation! I've noticed regex to be very popular to solving NLP problems. I will need to do some studying at some point. – asleniovas Jan 23 '20 at 12:13

score 1 · Answer 3 · answered Jan 23 '20 at 11:54

1

What about using

text = file_open.translate(str.maketrans(",.", "  "))

and adding other characters you want to ignore into the first string.

answered Jan 23 '20 at 11:54

Znerual

173
1
10

How to strip string from punctuation except apostrophes for NLP

3 Answers3