How remove occurrences of \*** in a string

Question

I'm parsing through a pdf file that I converted its content to strings and there are many occurrences of \*** (* meaning any symbol)happening inside words. For example:

transaction, a middle ground has seemed workable\xe2\x80\x94norms explicitly articulated, backed by sanctions of the relevant professional associations

Using text.replace("\\***","") obviously does not work and so I was looking into using re.sub().

I'm having trouble with the syntax (reg expressions) to put into the arguements and was hoping for some help with it.

It's the very epitome of "I thought of using GREP and now I have *two* problems"! — Jongware, Jan 16 '17 at 22:29
You do not have `\***` in your string. `\\ ` is the escape character of `\xe2`. You have three consecutive non-ASCII characters. Perhaps that's what you need to remove. — DYZ, Jan 16 '17 at 22:32
You are solving the problem the wrong way around. Removing these characters would leave "workablenorms". However, correctly *decoding* them (it's a UTF8 sequence) would insert an [en-dash](http://www.fileformat.info/info/unicode/char/2014/index.htm). — Jongware, Jan 16 '17 at 22:33
Possible duplicate of [Replace non-ASCII characters with a single space](http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space) — DYZ, Jan 16 '17 at 22:38

score 4 · Accepted Answer · answered Jan 16 '17 at 22:33

4

how bout text.decode("utf8") ... thats what i think you actually want to do

or you could strip them out with

text.decode("ascii","ignore")

(in python 3 you might need to use codecs.decode(text,"ascii","ignore") (not entirely sure off hand))

answered Jan 16 '17 at 22:33

Joran Beasley

110,522
12
160
179

3

Ignoring them is a bad idea, because the words on the left and on the right get lumped together. – DYZ Jan 16 '17 at 22:36
2

I certainly do not disagree ... i just figured ignore was more in line with the original question... – Joran Beasley Jan 16 '17 at 22:38
1

@AshleyNewman If you find the answer useful, you should consider to [accept](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) it – Moinuddin Quadri Jan 16 '17 at 22:53

sgDysregulation · Answer 2 · 2017-01-16T22:38:09.357

0

you can use ^ not to filter any none ascii/utf8 character

import re
text = re.sub(r'[^\x00-\x7F]', ' ', text)

result will be

'transaction, a middle ground has seemed workablenorms explicitly articulated, backed by sanctions of the relevant professional associations'

edited Jan 16 '17 at 22:38

answered Jan 16 '17 at 22:36

sgDysregulation

4,309
2
23
31

How remove occurrences of \*** in a string

2 Answers2