remove all possible unwanted characters from python string at once

Question

I'm using python module newspaper3k and extracting article summary using its web url. As,

from newspaper import Article
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
print (text)

Gives,

Often hailed as Hollywood\xe2\x80\x99s long standing, commercially successful filmmaker, Spielberg\xe2\x80\x99s lifetime gross, if you include his productions, reaches a mammoth\xc2\xa0$17.2 billion\xc2\xa0\xc2\xad\xe2\x80\x93 unadjusted for inflation.
\r\rThe original\xc2\xa0Jurassic Park\xc2\xa0($983.8 million worldwide), which released in 1993, remains Spielberg\xe2\x80\x99s highest grossing film.
Ready Player One,\xc2\xa0currently advancing at a running total of $476.1 million, has become Spielberg\xe2\x80\x99s seventh highest grossing film of his career.It will eventually supplant Aamir\xe2\x80\x99s 2017 blockbuster\xc2\xa0Dangal\xc2\xa0(1.29 billion yuan) if it achieves the Maoyan\xe2\x80\x99s lifetime forecast of 1.31 billion yuan ($208 million) in the PRC.

All I want to remove all unwanted characters like \xe2\x80\x99s. I'm avoiding to use multiple replace function. All I want something like:-

Often hailed as Hollywood long standing, commercially successful filmmaker, 
Spielberg lifetime gross, if you include his productions, reaches a 
mammoth $17.2 billion unadjusted for inflation.
The original Jurassic Park ($983.8 million worldwide), 
which released in 1993, remains Spielberg highest grossing film.
Ready Player One,currently advancing at a running total of $476.1 million, 
has become Spielberg seventh highest grossing film of his career.
It will eventually supplant Aamir 2017 blockbuster Dangal (1.29 billion yuan) 
if it achieves the Maoyan lifetime forecast of 1.31 billion yuan ($208 million) in the PRC

Why do you want to avoid use of replace? If it is because of syntactic concerns, you can use a single regex statement that removes all substrings of the form \x--. or is it because of time complexity concerns(as removing k substrings in a string of length n takes O(n*k) time) ? — Aayush Mahajan, Oct 02 '18 at 07:19
Beware, just removing all non ascii characters could result in an incorrect text. For example `'\xc2\xa2'` is utf8 for `'\xa0'` the unicode U+00A0 or NO BREAK SPACE character. Removing it could concatenate 2 adjacent words... — Serge Ballesta, Oct 02 '18 at 08:06

score 0 · Answer 1 · answered Oct 02 '18 at 07:22

0

Try using regular expressions:

import re
clear_str = re.sub(r'[\xe2\x80\x99s]', '', your_input)

re.sub replaces all occurences of a pattern in your_input with the 2nd argument. Pattern like [abc] matches either a, b or c character.

answered Oct 02 '18 at 07:22

Slawomir Gorawski

148
7

Oh, perhaps these characters of yours are substrings really, try with this pattern: `r'(\xe2|\x80|\x99s)'` – Slawomir Gorawski Oct 02 '18 at 07:41

score 0 · Answer 2 · answered Oct 02 '18 at 07:29

0

You can use python's encode/decode to get rid of every non-latin characters

data = text.decode('utf-8')
text = data.encode('latin-1', 'ignore')

answered Oct 02 '18 at 07:29

Yossi

11,778
2
53
66

score 0 · Answer 3 · answered Oct 02 '18 at 08:01

First use .encode('ascii',errors='ignore') to ignore all non ASCII characters.

If you need this text to do some sort of sentiment analysis, then you might also like to remove special characters like \n, \r, etc, which can be done by first escaping the escape characters, and then replacing them with the help of regex.

from newspaper import Article
import re
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
text = text.encode('ascii',errors='ignore')
text = str(text) #converts `\n` to `\\n` which can then be replaced by regex
text = re.sub('\\\.','',text) #Removes all substrings of form \\.
print (text)

Mark Tolonen · Answer 4 · 2018-10-02T08:39:08.063

The article was decoded incorrectly. It likely had the wrong encoding specified on the website, but without a valid url in the question to reproduce the output that's difficult to prove.

The escape codes indicate utf8 was the correct encoding, so use the following to encode back to bytes directly (latin1 is a 1:1 mapping from the first 256 Unicode codepoints to bytes), then decode with utf8:

text = text.encode('latin1').decode('utf8')

Result:

Often hailed as Hollywood’s long standing, commercially successful filmmaker, Spielberg’s lifetime gross, if you include his productions, reaches a mammoth $17.2 billion – unadjusted for inflation.

The original Jurassic Park ($983.8 million worldwide), which released in 1993, remains Spielberg’s highest grossing film. Ready Player One, currently advancing at a running total of $476.1 million, has become Spielberg’s seventh highest grossing film of his career.It will eventually supplant Aamir’s 2017 blockbuster Dangal (1.29 billion yuan) if it achieves the Maoyan’s lifetime forecast of 1.31 billion yuan ($208 million) in the PRC.

remove all possible unwanted characters from python string at once

4 Answers4