0

i am trying to remove non-printable characters from some string variables i have as i am reading in a text file. if i use the below re.sub method it won't work the \x.. chars are not removed

test1 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'
test2 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', test1)

but, if i take the value from test1 and place it in the re.sub as a "raw" string then it works perfectly

test2 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', r'ing record \xac\xd0\x81\xb4\x02\n2018 Apr')

test2 has 'ing record \n2018 Apr'

i was hoping to easily convert test1 in the first example into a raw string but i'm my searching this doesn't seem easy or possible. looking for a solution that allows me to use re.sub and remove these chars from a str variable , or if there is a way to convert my str variable into a raw string first?

UPDATE FIX: i ended up having to do a lot of conversions to remove the unwanted hex codes but keep my newlines. this works not sure if there is a cleaner method out there.

test33 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'
test44 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', test33.encode('unicode-escape').decode("utf-8"))
test66 = test44.encode().decode('unicode-escape')
print(test66)

ing record 
2018 Apr
john johnson
  • 699
  • 1
  • 12
  • 34
  • The raw string `r'\xac'` is four characters. The normal string `'\xac'` is one character. Your regex in the second case matches all the separate characters in the raw string but not when you use `test1` which has the characters defined by the escaped hex values. – Mike Robins Apr 05 '18 at 02:43
  • ah i see now. thanks , also that lead me to fig out what i needed to do i edited my post with what your advice lead me to :) thx – john johnson Apr 05 '18 at 12:06
  • You may also like to examine the answer to https://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python – Mike Robins Apr 06 '18 at 05:20

1 Answers1

0

If your string is purely ASCII you could try:

import re
import string

test33 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'

print re.sub(r'[^{0}\n]'.format(string.printable), '', test33)

or the unicode solution provided in:Stripping non printable characters from a string in python

Mike Robins
  • 1,733
  • 10
  • 14