Removing non-token punctation marks from a string

Question

I am trying to strip certain punctuation from my test in python. Essentially, I made a token counter and am trying to remove all excess punctuation (i.e. quotation marks, etc.) that surround a word without removing any relevant token information (i.e. apostrophe's).

I have looked here, here and here for inspiration. However, the proposed solutions do not necessarily address my problem.

For instance, I have cases of strings such as: ''couldn't

where I want to remove the '' but not the ' between n and t

So far, I have tried using re

such as:

excludeLine = line.strip(' "\'\t\r\n')

and

excludeLine = re.sub(r'[^\w\s]','',line)

and

excludeLine = re.sub('[%s]' % re.escape(string.punctuation), '', line)

which not only strip all of the punctuation, resulting in couldnt but it also strips all token-relevant punctation such as, the - in words such as state-of-the-art leaving me with stateoftheart.

Does any one have a solution that removes only those external, syntax/grammar-necessary punctation, such as quotation marks, single quotes (but preserving the apostrophe, hyphen, etc.), exclamation points, periods... etc.

EDIT This is the re that I am using to extract the individual token strings.

counter.update(x for x in split("[^a-zA-Z']+", line) if x)

Could it be that I need to refine it more?

score 1 · Accepted Answer · answered Nov 26 '15 at 14:44

1

excludeLine = re.sub(r'(?!\w.\w)(?:.|^)\K[^\w\s]', '', line)

(if your lib supports \K )

answered Nov 26 '15 at 14:44

AndreyS Scherbakov

2,674
2
20
27

score -1 · Answer 2 · answered Nov 26 '15 at 10:48

-1

re.sub(u"\u005C[nrt]", r"", YOUR_STRING)

answered Nov 26 '15 at 10:48

Mayur Koshti

1,794
15
20

Thanks, but it is still giving me a problem. It is not removing all of the single quotes. After implementing your proposal, I still have issues such as: `1 'obliterating'` `1 'ny` `1 'nut` `1 'nudge` `1 'novel'dscribed` `1 'noughtie'`.... etc. – owwoow14 Nov 26 '15 at 11:28

Removing non-token punctation marks from a string

2 Answers2