I am trying to strip certain punctuation from my test in python. Essentially, I made a token counter and am trying to remove all excess punctuation (i.e. quotation marks, etc.) that surround a word without removing any relevant token information (i.e. apostrophe's).
I have looked here, here and here for inspiration. However, the proposed solutions do not necessarily address my problem.
For instance, I have cases of strings such as: ''couldn't
where I want to remove the ''
but not the '
between n and t
So far, I have tried using re
such as:
excludeLine = line.strip(' "\'\t\r\n')
and
excludeLine = re.sub(r'[^\w\s]','',line)
and
excludeLine = re.sub('[%s]' % re.escape(string.punctuation), '', line)
which not only strip all of the punctuation, resulting in couldnt
but it also strips all token-relevant punctation such as, the -
in words such as state-of-the-art
leaving me with stateoftheart
.
Does any one have a solution that removes only those external, syntax/grammar-necessary punctation, such as quotation marks, single quotes (but preserving the apostrophe, hyphen, etc.), exclamation points, periods... etc.
EDIT
This is the re
that I am using to extract the individual token strings.
counter.update(x for x in split("[^a-zA-Z']+", line) if x)
Could it be that I need to refine it more?