python remove weird apostrophe and other weird characters not in string.punctuation

Question

This is my string:

mystring = "How’s it going?"

This is what i did:

import string
exclude = set(string.punctuation)

def strip_punctuations(mystring):
    for c in string.punctuation:
        new_string=''.join(ch for ch in mystring if ch not in exclude)
        new_string = chat_string.replace("\xe2\x80\x99","")
        new_string = chat_string.replace("\xc2\xa0\xc2\xa0","")
    return chat_string

OUTPUT:

If i did not include this line new_string = chat_string.replace("\xe2\x80\x99","") this will be the output:

 'How\xe2\x80\x99s it going'

i realized exclude does not have that weird looking apostrophe in the list:

print set(exclude)
set(['!', '#', '"', '%', '$', "'", '&', ')', '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', '<', '?', '>', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~'])

How do i ensure all such characters are taken out instead of manually replacing them in the future?

You should not manipulate utf8 strings as bytes. Decode them first. — Daniel, Jun 21 '16 at 17:16
Thanks this solved it! just to to use `unicode(mystrin)` then `re.sub(ur"\p{P}+", "", mystrin)` — jxn, Jun 21 '16 at 22:19

score 1 · Answer 1 · edited Feb 13 '21 at 11:00

1

If you are working with long texts like news articles or web scraping, then you can either use "goose" or "NLTK" python libraries. These two are not pre-installed. Here are the links to the libraries. goose, NLTK

You can go through the document and learn how to do.

OR

if you don't want to use these libraries, you may want to create your own "exclude" list manually.

edited Feb 13 '21 at 11:00

DisappointedByUnaccountableMod

6,656
4
18
22

answered Jun 21 '16 at 17:19

Minjun Kim

29
1
9

Brunaldo · Answer 2 · 2016-06-21T17:37:31.563

0

import re

toReplace = "how's it going?"
regex = re.compile('[!#%$\"&)\'(+*-/.;:=<?>@\[\]_^`\{\}|~"\\\\"]')
newVal = regex.sub('', toReplace)
print(newVal)

The regex matches all the characters you've set and it replaces them with empty whitespace.

edited Jun 21 '16 at 17:37

answered Jun 21 '16 at 17:32

Brunaldo

49
2
7

python remove weird apostrophe and other weird characters not in string.punctuation

2 Answers2