0

I'm parsing through a pdf file that I converted its content to strings and there are many occurrences of \*** (* meaning any symbol)happening inside words. For example:

transaction, a middle ground has seemed workable\xe2\x80\x94norms explicitly articulated, backed by sanctions of the relevant professional associations

Using text.replace("\\***","") obviously does not work and so I was looking into using re.sub().

I'm having trouble with the syntax (reg expressions) to put into the arguements and was hoping for some help with it.

borrimorri
  • 119
  • 9
  • 4
    Is `*` literally an asterisk or just any symbol? – DYZ Jan 16 '17 at 22:25
  • 1
    Have you tried `text.replace("\\***","")` ? – fafl Jan 16 '17 at 22:27
  • * meaning any symbol @DYZ – borrimorri Jan 16 '17 at 22:29
  • 1
    It's the very epitome of "I thought of using GREP and now I have *two* problems"! – Jongware Jan 16 '17 at 22:29
  • 2
    You do not have `\***` in your string. `\\ ` is the escape character of `\xe2`. You have three consecutive non-ASCII characters. Perhaps that's what you need to remove. – DYZ Jan 16 '17 at 22:32
  • 3
    You are solving the problem the wrong way around. Removing these characters would leave "workablenorms". However, correctly *decoding* them (it's a UTF8 sequence) would insert an [en-dash](http://www.fileformat.info/info/unicode/char/2014/index.htm). – Jongware Jan 16 '17 at 22:33
  • Possible duplicate of [Replace non-ASCII characters with a single space](http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space) – DYZ Jan 16 '17 at 22:38

2 Answers2

4

how bout text.decode("utf8") ... thats what i think you actually want to do

or you could strip them out with

text.decode("ascii","ignore") 

(in python 3 you might need to use codecs.decode(text,"ascii","ignore") (not entirely sure off hand))

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • 3
    Ignoring them is a bad idea, because the words on the left and on the right get lumped together. – DYZ Jan 16 '17 at 22:36
  • 2
    I certainly do not disagree ... i just figured ignore was more in line with the original question... – Joran Beasley Jan 16 '17 at 22:38
  • 1
    @AshleyNewman If you find the answer useful, you should consider to [accept](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) it – Moinuddin Quadri Jan 16 '17 at 22:53
0

you can use ^ not to filter any none ascii/utf8 character

import re
text = re.sub(r'[^\x00-\x7F]', ' ', text)

result will be

'transaction, a middle ground has seemed workablenorms explicitly articulated, backed by sanctions of the relevant professional associations'
sgDysregulation
  • 4,309
  • 2
  • 23
  • 31