5

I am using the below "fastest" way of removing punctuation from a string:

text = file_open.translate(str.maketrans("", "", string.punctuation))

However, it removes all punctuation including apostrophes from tokens such as shouldn't turning it into shouldnt.

The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text. For example for shouldnt the stopwords included are shouldn, shouldn't, t.

I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords. But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.

Is there a way I can leave the apostrophes when doing fast punctuation cleaning?

asleniovas
  • 193
  • 3
  • 21

3 Answers3

7
>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"
buran
  • 13,682
  • 10
  • 36
  • 61
3

Edited from this answer.

import re

s = "This is a test string, with punctuation. This shouldn't fail...!"

text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)

This returns:

This is a test string with punctuation This shouldn't fail

Regex explanation:

[^] matches everything but everything inside the blockquotes
\w matches any word character (equal to [a-zA-Z0-9_])
\d matches a digit (equal to [0-9])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\' matches the character ' literally (case sensitive)
+ matches between one and unlimited times, as many times as possible, giving back as needed

And you can try it here.

funie200
  • 3,688
  • 5
  • 21
  • 34
  • 1
    thanks for the detailed regex explanation! I've noticed regex to be very popular to solving NLP problems. I will need to do some studying at some point. – asleniovas Jan 23 '20 at 12:13
1

What about using

text = file_open.translate(str.maketrans(",.", "  "))

and adding other characters you want to ignore into the first string.

Znerual
  • 173
  • 1
  • 10