1

I've converted a pdf document into file by using pdftotext -raw /path/to/pdf.pdf /path/to/output.txt in ubuntu. I read the converted file using sample = open("/path/to/output.txt").read(). Now sample has undecoded unicode strings like \xe2\x80\x99. I want to replace them using regex with ''. I used the patterns re.sub(r"""\\\\"""," ",sample),re.sub(r'\\x..',"",sample),re.sub(r'\\\\x..'," ",sample)

For example take this

abc="CTIinfo@thecoaches.com\n\x0c"
re.sub(r'\\x..',"",abc)
re.sub(r'\\\\x..'," ",abc)
abc.encode("ascii","ignore")

I evaluated \\x.. pattern using this online regex tester choosing language as python also this and used \\\\x.. pattern based on reference from this SO Question's answer but both gives me CTIinfo@thecoaches.com\n\x0c as output. It is not removing those unicode strings. I don't want to use the pattern \\\w.. as it may select escape sequences. Even I tried abc.encode('utf8') which throws UnicodeDecodeError. I understand the problem is because \x?? is being read as string but I don't know how to fix this.

If you want to run tests on the solutions please use these:

182\nWheel of Life, 24\xe2\x80\x9325, 135\xe2\x80\x93136
\n194\xe2\x80\x93195
CTI\xe2\x80\x99s\ntraining enables participants 
80\xe2\x80\x9383

The expected output of those test strings should be

182\nWheel of Life, 2425, 135136
\n194195
CTIs\ntraining enables participants 
8083

Note:

I've also tried

abc=abc.decode("utf-8")
abc=abc.encode("ascii","ignore")

this removes some character but still i can see some strings like \x0c which is form feed so I want only regex way to replace these strings.

Tried regular expressions:

abc="CTIinfo@th\x0c\xc0ecoaches.com\n\x0c" #input

re.sub(r'[\\x[a-fA-F0-7]-\\x[a-fA-F0-7]]+',' ',abc)
re.sub(r'[^\x00-\x7F]+',' ',abc)
re.sub(r'\\x..',"",abc)
re.sub(r'\\\\x..'," ",abc)

please add reasons for downvoting. as It ll help me to understand my mistakes. The problem may be simple but solution is needed. I've done so many researches and experimentations before posting it here, I hope people 'll value them

Community
  • 1
  • 1
Mani
  • 5,401
  • 1
  • 30
  • 51

2 Answers2

3

Found the fix the characters ranges from \x00-\x7f includes all characters in keyboard hence re.sub(r'[^\x00-\x7f]+','', abc) replaces every characters and result is ''

Non-printable characters like \f\v are recognised by python interpreter as \x0c\x0b where as other non-printable characters are recognised as it is eg: \n\r\b is recognised as \n\r\b. Hence in order to replace only \x0c\x0b which is \f\v to but to skip other escape sequences and characters the regular expression would be re.sub(r'[\x0b-\x0c]','',(re.sub(r'[^\x00-\x7f]+','', abc))) or re.sub(r'[^\x00-\x7f]+','', abc).replace("\f","").replace("\v","") also works

The regex replaces \x0b,\x0c from the replaced string and other non-printable characters are preserved. Which is also done by str.replace() of \f and \v

Only these two characters different in recognition by python since only these combines functionality of other two escape sequences.

Example:

\f ==> \n+\r
\v ==> \n+\t
Mani
  • 5,401
  • 1
  • 30
  • 51
  • 1
    super helpful as I was using the PowerPoint Python module and line feeds were showing up as emojis in my terminal when I was writing them to sqlite. Thank you. – Dean Nov 01 '21 at 20:17
-2

Please see this link How does \v differ from \x0b or \x0c?

\x is not separate, these four characters are one group.

re.sub(r"\x0c","",abc)

Community
  • 1
  • 1
Nisar Ahmad
  • 29
  • 1
  • 4