0

I am trying to strip certain punctuation from my test in python. Essentially, I made a token counter and am trying to remove all excess punctuation (i.e. quotation marks, etc.) that surround a word without removing any relevant token information (i.e. apostrophe's).

I have looked here, here and here for inspiration. However, the proposed solutions do not necessarily address my problem.

For instance, I have cases of strings such as: ''couldn't

where I want to remove the '' but not the ' between n and t

So far, I have tried using re

such as:

excludeLine = line.strip(' "\'\t\r\n')

and

excludeLine = re.sub(r'[^\w\s]','',line)

and

excludeLine = re.sub('[%s]' % re.escape(string.punctuation), '', line)

which not only strip all of the punctuation, resulting in couldnt but it also strips all token-relevant punctation such as, the - in words such as state-of-the-art leaving me with stateoftheart.

Does any one have a solution that removes only those external, syntax/grammar-necessary punctation, such as quotation marks, single quotes (but preserving the apostrophe, hyphen, etc.), exclamation points, periods... etc.

EDIT This is the re that I am using to extract the individual token strings.

counter.update(x for x in split("[^a-zA-Z']+", line) if x)

Could it be that I need to refine it more?

Community
  • 1
  • 1
owwoow14
  • 1,694
  • 8
  • 28
  • 43

2 Answers2

1
excludeLine = re.sub(r'(?!\w.\w)(?:.|^)\K[^\w\s]', '', line)

(if your lib supports \K )

AndreyS Scherbakov
  • 2,674
  • 2
  • 20
  • 27
-1
re.sub(u"\u005C[nrt]", r"", YOUR_STRING)
Mayur Koshti
  • 1,794
  • 15
  • 20
  • Thanks, but it is still giving me a problem. It is not removing all of the single quotes. After implementing your proposal, I still have issues such as: `1 'obliterating'` `1 'ny` `1 'nut` `1 'nudge` `1 'novel'dscribed` `1 'noughtie'`.... etc. – owwoow14 Nov 26 '15 at 11:28