efficiently replace bad characters

Question

I often work with utf-8 text containing characters like:

\xc2\x99

\xc2\x95

\xc2\x85

etc

These characters confuse other libraries I work with so need to be replaced.

What is an efficient way to do this, rather than:

text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')

I still use unicode, but there are certain characters that trip up the library that need to be replaced — hoju, Jul 07 '11 at 13:11
I believe you'll want to use `text.translate(table)` as per http://docs.python.org/library/stdtypes.html#str.translate — TryPyPy, Jul 08 '11 at 13:55
@TryPyPy: Make your comment an answer so I can upvote it. You might also want to mention how Python 3+ has [`str.maketrans()`](http://docs.python.org/release/3.1.3/library/stdtypes.html#str.maketrans) as well. — JAB, Jul 08 '11 at 14:06

score 38 · Accepted Answer · edited Jul 08 '11 at 13:36

38

There is always regular expressions; just list all of the offending characters inside square brackets like so:

import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")

This prints: 'Hello There ', with the unwanted characters replaced by spaces.

Alternately, if you have a different replacement character for each:

# remove annoying characters
chars = {
    '\xc2\x82' : ',',        # High code comma
    '\xc2\x84' : ',,',       # High code double comma
    '\xc2\x85' : '...',      # Tripple dot
    '\xc2\x88' : '^',        # High carat
    '\xc2\x91' : '\x27',     # Forward single quote
    '\xc2\x92' : '\x27',     # Reverse single quote
    '\xc2\x93' : '\x22',     # Forward double quote
    '\xc2\x94' : '\x22',     # Reverse double quote
    '\xc2\x95' : ' ',
    '\xc2\x96' : '-',        # High hyphen
    '\xc2\x97' : '--',       # Double hyphen
    '\xc2\x99' : ' ',
    '\xc2\xa0' : ' ',
    '\xc2\xa6' : '|',        # Split vertical bar
    '\xc2\xab' : '<<',       # Double less than
    '\xc2\xbb' : '>>',       # Double greater than
    '\xc2\xbc' : '1/4',      # one quarter
    '\xc2\xbd' : '1/2',      # one half
    '\xc2\xbe' : '3/4',      # three quarters
    '\xca\xbf' : '\x27',     # c-single quote
    '\xcc\xa8' : '',         # modifier - under curve
    '\xcc\xb1' : ''          # modifier - under line
}
def replace_chars(match):
    char = match.group(0)
    return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

edited Jul 08 '11 at 13:36

hoju

28,392
37
134
178

answered Jul 07 '11 at 11:39

Nate

12,499
5
45
60

that is a good approach, however we would want to set different replacement characters for each – hoju Jul 07 '11 at 13:15
can you give an example of what you mean? I'd be happy to address a more specific case. – Nate Jul 07 '11 at 13:46
Hi Nate - the downvote because replacing this way is _not_ what should be done in this case, although the OP have asked for that. (Ok, I was bitter, and will de-downvote you) - Python have sofisticated mechanisms to convert encoded strings back and forth and those should be used. – jsbueno Jul 07 '11 at 14:15
@Steven Rumbalski: You're right, my answer *sure* doesn't apply once he changes the question. – Nate Jul 07 '11 at 17:02
regex would still be possible using a replacement function as second arg instead of a fixed string. That is the approach I was considering, but wanted to get feedback first. – hoju Jul 08 '11 at 00:35
@jsbueno: encoding is not the issue – hoju Jul 08 '11 at 00:37
@Plumo - Ah! Yes, that is a fine solution, now that i understand what you said in your first comment to me. The `repl` function could be something as simple as a lambda that retrieved a replacement from a dictionary, if the replacements are constant. And I was being a little sarcastic before... I guess I just didn't appreciate being randomly downvoted by someone who didn't see fit to provide a "better" answer, just because you were good enough to clarify your question. – Nate Jul 08 '11 at 00:46
For the record I upvoted you. Anyway, I extended your idea to use the replacement function. Hope that is OK. – hoju Jul 08 '11 at 13:38
oh, absolutely fine. Also, I mis-spoke - I didn't mean you downvoted me, I meant Steven. – Nate Jul 08 '11 at 14:04
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex refer to this url for unicode sequences to respective values map – BuzzR Jan 16 '19 at 09:03

Gareth Rees · Answer 2 · 2013-09-24T14:50:22.057

I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.

\xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?

Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)

If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)

>>> b'\x95'.decode('windows-1252')
'\u2022'
>>> import unicodedata
>>> unicodedata.name(_)
'BULLET'

If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:

def restore_windows_1252_characters(s):
    """Replace C1 control characters in the Unicode string s by the
    characters at the corresponding code points in Windows-1252,
    where possible.

    """
    import re
    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''
    return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)

For example:

>>> restore_windows_1252_characters('\x95\x99\x85')
'•™…'

interesting. The data I am working with is random HTML pages so this seems likely. — hoju, Jul 10 '11 at 14:45
Ah! If you are working with random HTML pages, you need to perform *character encoding auto-detection*. How are you determining the encoding of the pages? (The problem being that very commonly, a page may *say* it's encoded in ISO Latin-1, but actually it's in Windows-1252.) — Gareth Rees, Jul 10 '11 at 14:47

score 12 · Answer 3 · answered Jul 07 '11 at 11:47

12

If you want to remove all non-ASCII characters from a string, you can use

text.encode("ascii", "ignore")

answered Jul 07 '11 at 11:47

Tim Pietzcker

328,213
58
503
561

3

Just make sure that `text` is a unicode string - i.e., is defined `text=u"..."` - if not, this raises a `UnicodeDecodeError`. – Nate Jul 07 '11 at 11:50
Also make sure that you don't want to strip down to just ASCII! (goes without saying :p) – 2rs2ts Jul 07 '11 at 12:19

score 2 · Answer 4 · edited Apr 27 '17 at 07:49

2

import unicodedata

# Convert to unicode
text_to_uncicode = unicode(text, "utf-8")           

# Convert back to ascii
text_fixed = unicodedata.normalize('NFKD',text_to_unicode).encode('ascii','ignore')

edited Apr 27 '17 at 07:49

Tom

4,257
6
33
49

answered Apr 27 '17 at 07:28

Ady

31
4

A bit more explanation around your answer is always helpful. – Tom Apr 27 '17 at 07:49

score 0 · Answer 5 · edited Nov 22 '16 at 08:18

0

These characters are not in ASCII Library and that is the reason why you are getting the errors. To avoid these errors, you can do the following while reading the file.

import codecs   
f = codecs.open('file.txt', 'r',encoding='utf-8')

To know more about these kind of errors, go through this link.

edited Nov 22 '16 at 08:18

Remi Guan

21,506
17
64
87

answered Nov 22 '16 at 07:32

Mokshith Sandeep

75
5

score 0 · Answer 6 · answered Jul 07 '11 at 14:13

This is not "Unicode characters" - it feels more like this an UTF-8 encoded string. (Although your prefix should be \xC3, not \xC2 for most chars). You should not just throw them away in 95% of the cases, unless you are comunicating with a COBOL backend. The World is not limited to 26 characters, you know.

There is a concise reading to explain the differences between Unicode strings (what is used as an Unicode object in python 2 and as strings in Python 3 here: http://www.joelonsoftware.com/articles/Unicode.html - please, for your sake do read that. Even if you are never planning to have anything that is not English in all of your applications, you still will stumble on symbols like € or º that won't fit in 7 bit ASCII. That article will help you.

That said, maybe the libraries you are using do accept Unicode python objects, and you can transform your UTF-8 Python 2 strings into unidoce by doing:

var_unicode = var.decode("utf-8")

If you really need 100% pure ASCII, replacing all non ASCII chars, after decoding the string to unicode, re-encode it to ASCII, telling it to ignore characters that don't fit in the charset with:

var_ascii = var_unicode.encode("ascii", "replace")

the issue is not unicode vs ascii. The libraries and service I rely on support utf-8 but get tripped up by certain characters. So simply I will remove them because they are not important. — hoju, Jul 08 '11 at 00:32
"The libraries and service I rely on support utf-8 but get tripped up by certain characters." So they don't support UTF-8 itself, they support a subset of UTF-8. — JAB, Jul 08 '11 at 14:05

efficiently replace bad characters

6 Answers6

Linked