6

I think I'm going crazy with Python's unicode strings. I'm trying to encode escape characters in a Unicode string without escaping actual Unicode characters. I'm getting this:

In [14]: a = u"Example\n"

In [15]: b = u"Пример\n"

In [16]: print a
Example


In [17]: print b
Пример


In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
\u041f\u0440\u0438\u043c\u0435\u0440\n

while I desperately need (English example works as I want, obviously):

In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
Пример\n

What should I do, short of moving to Python 3?

PS: As pointed out below, I'm actually seeking to escape control characters. Whether I need more than just those will have to be seen.

Nikolai Prokoschenko
  • 8,465
  • 11
  • 58
  • 97
  • What characters do you want to encode? Just `\r\n\t`? There is no such thing as an "escape character". – agf Mar 19 '12 at 22:02
  • 3
    The thing is, your request is paradoxical. Python 2 strings (Python 3 `bytes`) do not contain unicode characters. They only contains bytes. These bytes may be unicode codepoints stored in a specific encoding, but they're still only bytes. If you want to store unicode, use `unicode`. If you want bytes, use bytes - but then you don't have unicode, you just have bytes without the information that it's UTF-*. It might as well be some weird 8-bit codepage. Also see http://nedbatchelder.com/text/unipain.html which provides some insight and general approaches. –  Mar 19 '12 at 22:03
  • @agf Essentially every "special" character. At the very least I would like Python to know that a Unicode codepoint is a letter and leave it alone. – Nikolai Prokoschenko Mar 19 '12 at 22:06
  • 1
    @rassie You need to define "special" character. Probably you just need to encode it to utf-8 or whatever and then use a regex. There isn't a standard encoding that does what you want. – agf Mar 19 '12 at 22:11
  • @delnan: I am using `unicode`, as far as I see (maybe I'm wrong). I honestly don't see why an escaping function escapes cyrillic letters but doesn't touch latin ones (I wouldn't complain if it encoded latin letters too!) My use case: I'm playing with `ast` and want to output strings like in the original code, i.e. I need to be able to output `"Пример\n"` verbatim instead of a string with a line break afterwards. It deems impossible without incomplete hacks, like replacing only a subset of escape sequences. – Nikolai Prokoschenko Mar 19 '12 at 22:13
  • @agf: every escape sequence interpretable by Python. – Nikolai Prokoschenko Mar 19 '12 at 22:14
  • @rassie But those characters aren't stored as escape sequences any more than Cyrillic characters are. There is no difference between the two. – agf Mar 19 '12 at 22:16
  • What are you looking for? `u"Пример\n" == u"\u041f\u0440\u0438\u043c\u0435\u0440\n"` is `True` – Daenyth Mar 19 '12 at 22:33
  • @Daenyth: yes, evaluated they are equal. Their representation is not -- I need cyrillic instead of escaped Unicode codepoints. – Nikolai Prokoschenko Mar 19 '12 at 22:35
  • @rassie: `>>> repr(u"Пример\n") "u'\\u041f\\u0440\\u0438\\u043c\\u0435\\u0440\\n'" >>> repr(u"\u041f\u0440\u0438\u043c\u0435\u0440\n") "u'\\u041f\\u0440\\u0438\\u043c\\u0435\\u0440\\n'" ` -- The repr is equal too because they are the same string. – Daenyth Mar 19 '12 at 22:38
  • What exactly do you need this for? – Karl Knechtel Mar 19 '12 at 22:51
  • what he wants is to go from u"Пример\n" to u"Пример\\n" and back again. this is useful for me because i need to create a TSV without quotes around fields with one record per line allowing utf-8 encoded data in the fields (in a readable way). so i need to escape \t and \n (and therefore \ as well). this is a reasonable request, but i don't have a good solution. – underrun Aug 07 '13 at 19:21

4 Answers4

4

Backslash escaping ascii control characters in the middle of unicode data is definitely a useful thing to try to accomplish. But it's not just escaping them, it's properly unescaping them when you want the actual character data back.

There should be a way to do this in the python stdlib, but there is not. I filed a bug report: http://bugs.python.org/issue18679

but in the mean time, here's a work around using translate and hackery:

tm = dict((k, repr(chr(k))[1:-1]) for k in range(32))
tm[0] = r'\0'
tm[7] = r'\a'
tm[8] = r'\b'
tm[11] = r'\v'
tm[12] = r'\f'
tm[ord('\\')] = '\\\\'

b = u"Пример\n"
c = b.translate(tm)
print(c) ## results in: Пример\n

All the non-backslash-single-letter control characters will be escaped with the \x## sequence, but if you need something different done with those, your translation matrix can do that. This approach is not lossy though, so it works for me.

But getting it back out is hacky too because you can't just translate character sequences back into single characters using translate.

d = c.encode('latin1', 'backslashreplace').decode('unicode_escape')
print(d) ## result in Пример with trailing newline character

you actually have to encode the characters that map to bytes individually using latin1 while backslash escaping unicode characters that latin1 doesn't know about so that the unicode_escape codec can handle reassembling everything the right way.

UPDATE:

So I had a case where I needed this to work in both python2.7 and python3.3. Here's what I did (buried in a _compat.py module):

if isinstance(b"", str):                                                        
    byte_types = (str, bytes, bytearray)                                        
    text_types = (unicode, )                                                    
    def uton(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntob(x): return x                                                       
    def ntou(x): return x.decode('utf-8', 'surrogateescape')                    
    def bton(x): return x
else:                                                                           
    byte_types = (bytes, bytearray)                                             
    text_types = (str, )                                                        
    def uton(x): return x                                                       
    def ntob(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntou(x): return x                                                       
    def bton(x): return x.decode('utf-8', 'surrogateescape')    

escape_tm = dict((k, ntou(repr(chr(k))[1:-1])) for k in range(32))              
escape_tm[0] = u'\0'                                                            
escape_tm[7] = u'\a'                                                            
escape_tm[8] = u'\b'                                                            
escape_tm[11] = u'\v'                                                           
escape_tm[12] = u'\f'                                                           
escape_tm[ord('\\')] = u'\\\\'

def escape_control(s):                                                          
    if isinstance(s, text_types):                                               
        return s.translate(escape_tm)
    else:
        return s.decode('utf-8', 'surrogateescape').translate(escape_tm).encode('utf-8', 'surrogateescape')

def unescape_control(s):                                                        
    if isinstance(s, text_types):                                               
        return s.encode('latin1', 'backslashreplace').decode('unicode_escape')
    else:                                                                       
        return s.decode('utf-8', 'surrogateescape').encode('latin1', 'backslashreplace').decode('unicode_escape').encode('utf-8', 'surrogateescape')
underrun
  • 6,713
  • 2
  • 41
  • 53
3

First let's correct the terminology. What you're trying to do is replace "control characters" with an equivalent "escape sequence".

I haven't been able to find any built-in method to do this, and nobody has yet posted one. Fortunately it's not a hard function to write.

control_chars = [unichr(c) for c in range(0x20)] # you may extend this as required

def control_escape(s):
    chars = []
    for c in s:
        if c in control_chars:
            chars.append(c.encode('unicode_escape'))
        else:
            chars.append(c)
    return u''.join(chars)

Or the slightly less readable one-liner version:

def control_escape2(s):
    return u''.join([c.encode('unicode_escape') if c in control_chars else c for c in s])
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
1

The method .encode returns a byte-string (type str in Python 2), so it cannot return unicode characters.

But as there are only few \ - sequences you can easily .replace them manually. See http://docs.python.org/reference/lexical_analysis.html#string-literals for a complete list.

Leovt
  • 313
  • 2
  • 4
0

.encode('unicode_escape') returns a byte string. You probably want to escape the control characters directly in the Unicode string:

# coding: utf8
import re

def esc(m):
    return u'\\x{:02x}'.format(ord(m.group(0)))

s = u'\r\t\b马克\n'

# Match control characters 0-31.
# Use DOTALL option to match end-of-line control characters as well.
print re.sub(ur'(?s)[\x00-\x1f]',esc,s)

Output:

\x0d\x09\x08马克\x0a

Note there are other Unicode control characters beyond 0-31, so you may need something more like:

# coding: utf8
import re
import unicodedata as ud

def esc(m):
    c = m.group(0)
    if ud.category(c).startswith('C'):
        return u'\\u{:04x}'.format(ord(c))
    return c

s = u'\rMark\t\b马克\n'

# Match ALL characters so the replacement function
# can test the category.  Not very efficient if the string is long.
print re.sub(ur'(?s).',esc,s)

Output:

\u000dMark\u0009\u0008马克\u000a

You may want finer control of what is considered a control character. There are a number of categories. You could build a regular expression matching a specific type with:

import sys
import re
import unicodedata as ud

# Generate a regular expression that matches any Cc category Unicode character.
Cc_CODES = u'(?s)[' + re.escape(u''.join(unichr(n) for n in range(sys.maxunicode+1) if ud.category(unichr(n)) == 'Cc')) + u']'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251