3

I have a text file from which I have to read a lot of numbers (double). It has ASCII control characters like DLE, NUL etc. which are visible in the text file. so when I read them to get only the doubles/ints from a line, I am getting erros like "invalid literals \x10". Shown below are the first 2 lines of my file.

DLE NUL NUL NUL [1, 167, 133, 6]DLE NUL NUL   
YS FS NUL[0.0, 4.3025989e-07, 1.5446712e-06, 3.1393029e-06, 5.0430463e-06, 7.1382601e-06

How do I remove all these control characters from a text file at once, using Python? I want this to be done before I parse the file into numbers ...

Any help is appreciated!

atmaere
  • 345
  • 1
  • 8
  • 18

2 Answers2

3

Use string.printable.

>>> import string
>>> filter(string.printable.__contains__, '\x00\x01XYZ\x00\x10')
'XYZ'
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Using regex (see [this answer](http://stackoverflow.com/a/93029/1988505)) is an order of magnitude faster. – Wesley Baugh Nov 07 '14 at 20:31
  • @WesleyBaugh, If speed matters, you can use [`str.translate`](https://docs.python.org/2/library/stdtypes.html#str.translate). – falsetru Nov 08 '14 at 00:21
  • @alvas, How about using `unicode(string.printable)` if you want to use exactly same characters? – falsetru Mar 18 '15 at 12:21
2

I know it is very old post, but I am answering as I think, it could help others.

I did as follows. It will replace all ASCII control characters by an empty string.

line = re.sub(r'[\x00-\x1F]+', '', line)

Ref: ASCII (American Standard Code for Information Interchange) Code

Ref: Python re.sub()

user1012513
  • 2,089
  • 17
  • 14