I've converted a pdf document into file by using pdftotext -raw /path/to/pdf.pdf /path/to/output.txt
in ubuntu. I read the converted file using sample = open("/path/to/output.txt").read()
. Now sample has undecoded unicode strings like \xe2\x80\x99
. I want to replace them using regex with ''
. I used the patterns re.sub(r"""\\\\"""," ",sample),re.sub(r'\\x..',"",sample),re.sub(r'\\\\x..'," ",sample)
For example take this
abc="CTIinfo@thecoaches.com\n\x0c"
re.sub(r'\\x..',"",abc)
re.sub(r'\\\\x..'," ",abc)
abc.encode("ascii","ignore")
I evaluated \\x..
pattern using this online regex tester choosing language as python also this and used \\\\x..
pattern based on reference from this SO Question's answer but both gives me CTIinfo@thecoaches.com\n\x0c
as output. It is not removing those unicode strings. I don't want to use the pattern \\\w..
as it may select escape sequences. Even I tried abc.encode('utf8') which throws UnicodeDecodeError
. I understand the problem is because \x??
is being read as string but I don't know how to fix this.
If you want to run tests on the solutions please use these:
182\nWheel of Life, 24\xe2\x80\x9325, 135\xe2\x80\x93136
\n194\xe2\x80\x93195
CTI\xe2\x80\x99s\ntraining enables participants
80\xe2\x80\x9383
The expected output of those test strings should be
182\nWheel of Life, 2425, 135136
\n194195
CTIs\ntraining enables participants
8083
Note:
I've also tried
abc=abc.decode("utf-8")
abc=abc.encode("ascii","ignore")
this removes some character but still i can see some strings like \x0c
which is form feed so I want only regex way to replace these strings.
Tried regular expressions:
abc="CTIinfo@th\x0c\xc0ecoaches.com\n\x0c" #input
re.sub(r'[\\x[a-fA-F0-7]-\\x[a-fA-F0-7]]+',' ',abc)
re.sub(r'[^\x00-\x7F]+',' ',abc)
re.sub(r'\\x..',"",abc)
re.sub(r'\\\\x..'," ",abc)
please add reasons for downvoting. as It ll help me to understand my mistakes. The problem may be simple but solution is needed. I've done so many researches and experimentations before posting it here, I hope people 'll value them