1

This is my string:

mystring = "How’s it going?"

This is what i did:

import string
exclude = set(string.punctuation)

def strip_punctuations(mystring):
    for c in string.punctuation:
        new_string=''.join(ch for ch in mystring if ch not in exclude)
        new_string = chat_string.replace("\xe2\x80\x99","")
        new_string = chat_string.replace("\xc2\xa0\xc2\xa0","")
    return chat_string

OUTPUT:

If i did not include this line new_string = chat_string.replace("\xe2\x80\x99","") this will be the output:

 'How\xe2\x80\x99s it going'

i realized exclude does not have that weird looking apostrophe in the list:

print set(exclude)
set(['!', '#', '"', '%', '$', "'", '&', ')', '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', '<', '?', '>', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~'])

How do i ensure all such characters are taken out instead of manually replacing them in the future?

jxn
  • 7,685
  • 28
  • 90
  • 172

2 Answers2

1

If you are working with long texts like news articles or web scraping, then you can either use "goose" or "NLTK" python libraries. These two are not pre-installed. Here are the links to the libraries. goose, NLTK

You can go through the document and learn how to do.

OR

if you don't want to use these libraries, you may want to create your own "exclude" list manually.

Minjun Kim
  • 29
  • 1
  • 9
0
import re

toReplace = "how's it going?"
regex = re.compile('[!#%$\"&)\'(+*-/.;:=<?>@\[\]_^`\{\}|~"\\\\"]')
newVal = regex.sub('', toReplace)
print(newVal)

The regex matches all the characters you've set and it replaces them with empty whitespace.

Brunaldo
  • 49
  • 2
  • 7