Removing control characters from a string in python

Question

I currently have the following code

def removeControlCharacters(line):
    i = 0
    for c in line:
        if (c < chr(32)):
            line = line[:i - 1] + line[i+1:]
            i += 1
    return line

This is just does not work if there are more than one character to be deleted.

Alex Quinn · Accepted Answer · 2013-11-29T20:23:48.183

183

There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".

This snippet removes all control characters from a string.

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

Examples of unicode categories:

>>> from unicodedata import category
>>> category('\r')      # carriage return --> Cc : control character
'Cc'
>>> category('\0')      # null character ---> Cc : control character
'Cc'
>>> category('\t')      # tab --------------> Cc : control character
'Cc'
>>> category(' ')       # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A')       # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',')       # comma  -----------> Po : punctuation
'Po'
>>>

edited Nov 29 '13 at 20:23

answered Sep 25 '13 at 22:17

Alex Quinn

4,033
3
18
18

3

Upvoting, as this is the only correct answer for unicode-aware applications. – Will Nov 25 '13 at 15:10
1

Shouldn't the last line be: return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C") ? – jilles de wit Nov 28 '13 at 11:58
2

This is a very reliable solution of removing non-printable characters, thanks! – oski86 Aug 29 '17 at 10:57
1

Should the "Zl" category be included too? It's not clear to me what U+2028 really does, but I just had the misfortune of running into it... – flow2k Jul 27 '19 at 02:15
This the only correct answer.. – Ishan Kumar Oct 21 '21 at 07:01
1

This removes way more than just control characters. Importantly, this also removes characters in the `Cn` category, which are just "unassigned" characters (currently ~75% of available unicode points). Usually, you don't want these in your strings anyway, but in some cases the distinction matters. – Indigenuity Jan 22 '22 at 23:00

SilentGhost · Answer 2 · 2010-12-01T14:03:36.683

30

You could use str.translate with the appropriate map, for example like this:

>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'

edited Dec 01 '10 at 14:03

answered Dec 01 '10 at 13:30

SilentGhost

307,395
66
306
293

6

I'd suggest not using `map` as a variable name. – Mark Byers Dec 01 '10 at 13:43
3

Note, though, that this nukes newlines. – mlissner May 20 '11 at 07:40
4

This code isn't working. I keep getting `TypeError: expected a character buffer object` error. Python 2.6. – user1476056 Oct 19 '12 at 21:09
5

@user1476056: than you need to use newer version of Python. question is clearly tagged `python-3.x` – SilentGhost Oct 24 '12 at 12:09
1

I think this should be `dict.fromkeys(range(33))` since `range` is upper-bound exclusive. – dustinfarris Oct 10 '13 at 14:22
@dustinfarris That would include character 32, which is space. – Glenn Maynard Nov 04 '13 at 20:19
newline is a control character if i am not mistaken? – Angry 84 Jan 07 '16 at 16:21
1

Your code doesn't work, and terminates with `expected a character buffer object` on Python 2.7 – user1767754 Jul 07 '17 at 08:35

score 21 · Answer 3 · answered Sep 09 '16 at 16:37

Anyone interested in a regex character class that matches any Unicode control character may use [\x00-\x1f\x7f-\x9f].

You may test it like this:

>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True

So to remove the control characters using re just use the following:

>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'

One difference between this and the first answer is that this only works for `Cc` while that one works for `C*` — hyperknot, Apr 06 '18 at 22:23

cmc · Answer 4 · 2020-07-27T13:02:47.410

18

This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.

pip install regex

import regex as rx
def remove_control_characters(str):
    return rx.sub(r'\p{C}', '', 'my-string')

\p{C} is the unicode character property for control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z} for any kind of whitespace.

edited Jul 27 '20 at 13:02

answered Jan 16 '19 at 20:57

cmc

4,294
2
35
34

2

Agreed. The `regex` library will have up-to-date unicode info, compared to the built-in unicodedata module. – scribu Mar 01 '19 at 11:24
On a side note, I'd argue strongly for avoiding things like `import regex as re` on general principals, particularly when it collides with Python's stdlib. This approach obfuscates dependencies in code, adding an unnecessary confusion node vis-a-vis readability. Sticking with `import regex` and using `regex` within your code makes everything more clear. Aside from that, +1 on this answer. Upvoting. – Chris Larson Jul 26 '20 at 20:15
1

@ChrisLarson I agree! changed. – cmc Jul 27 '20 at 13:03
Can you please write on how to do it for a column of a dataframe? I am trying my_dataframe['column_1'].str.replace(r'\p{C}', ' ', regex=True). It gives error "error: bad escape \p" – Syed Md Ismail Mar 20 '21 at 16:07
@SyedMdIsmail If `my_dataframe['column_1'].str` is your input string, you would use `rx.sub(r'\p{C}', '', my_dataframe['column_1'].str)` to return the cleaned string. – cmc Mar 21 '21 at 19:04

Mark Byers · Answer 5 · 2010-12-01T13:50:36.263

9

Your implementation is wrong because the value of i is incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n²) instead of O(n). Try this instead:

return ''.join(c for c in line if ord(c) >= 32)

edited Dec 01 '10 at 13:50

answered Dec 01 '10 at 13:31

Mark Byers

811,555
193
1,581
1,452

2

at least twice as slow than `str.translate` – SilentGhost Dec 01 '10 at 13:37
1

@ben: it's all well and readable until `ord` chocks on non-BMP char – SilentGhost Dec 01 '10 at 14:10
Does `ord()` choke on non-BMP chars? `[ord(c) for c in u'\U00020000']` works fine for me, and the values in the resulting list are both >= 32 because they're surrogate pairs. – Ben Hoyt Dec 01 '10 at 14:25
2

Clarification: you're right that `ord(u'\U00020000')` will fail, at least on UCS2 builds of Python, but using `ord(c)` is fine in this case because iterating over the string always gives chars <= 65535. – Ben Hoyt Dec 01 '10 at 14:33

Eric O. Lebigot · Answer 6 · 2013-05-14T02:14:10.397

7

And for Python 2, with the builtin translate:

import string
all_bytes = string.maketrans('', '')  # String of 256 characters with (byte) value 0 to 255

line.translate(all_bytes, all_bytes[:32])  # All bytes < 32 are deleted (the second argument lists the bytes to delete)

edited May 14 '13 at 02:14

answered Dec 01 '10 at 16:02

Eric O. Lebigot

91,433
48
218
260

score 2 · Answer 7 · answered Dec 01 '10 at 13:33

2

You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])

answered Dec 01 '10 at 13:33

khachik

28,112
9
59
94

score 2 · Answer 8 · answered Dec 01 '10 at 15:02

2

filter(string.printable[:-5].__contains__,line)

answered Dec 01 '10 at 15:02

Kabie

10,489
1
38
45

that's limited to the ascii set. – SilentGhost Dec 01 '10 at 15:10

score 0 · Answer 9 · edited Oct 05 '21 at 14:00

0

I've tried all the above and it didn't help. In my case, I had to remove Unicode 'LRM' chars:

Finally I found this solution that did the job:

df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')

Reference here.

edited Oct 05 '21 at 14:00

Yun

3,056
6
9
28

answered Oct 05 '21 at 10:50

Oded L

11
1

score 0 · Answer 10 · answered Apr 13 '23 at 11:11

0

If you only want to remove a specific control character, you can do

line.replace("\x02", "")

Where \x02 is the code of the character, in this case STX (start of text). You can find these codes for example here.

answered Apr 13 '23 at 11:11

Matthias

3,160
2
24
38

Removing control characters from a string in python

10 Answers10

Linked

Related