1

I know there is tons of question about this, but somehow I could not find a solution to my problem (in python3) :

toto="//\udcc3\udca0"
fp = open('cool', 'w')
fp.write(toto)

I get:

File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 2: surrogates not allowed

How can I make it work?

Some precision: the string "//\udcc3\udca0" is given to me and I have no control over it. '\udcc3\udca0' is supposed to represent the character 'à'.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Archimondain
  • 374
  • 2
  • 17

1 Answers1

4

'\udcc3\udca0' is supposed to represent the character 'à'

The proper way to write 'à' using Python Unicode escapes is '\u00E0'. Its UTF-8 encoding is b'\xc3\xa0'.

It seems that whatever process produced your string was trying to use the UTF-8 representation, but instead of properly converting it to a Unicode string, it put the individual bytes in the U+DCxx range used by Python 3's surrogateescape convention.

>>> 'à'.encode('UTF-8').decode('ASCII', 'surrogateescape')
'\udcc3\udca0'

To fix the string, invert the operations that mangled it.

toto="//\udcc3\udca0"
toto = toto.encode('ASCII', 'surrogateescape').decode('UTF-8')
# At this point, toto == '//à', as intended.
fp = open('cool', 'w')
fp.write(toto)
dan04
  • 87,747
  • 23
  • 163
  • 198
  • 2
    Thanks for your answer to a somewhat weird issue. I feel that I should treat the cause rather than the disease. I'm not sure how I end up with the wrong '\udcc3\udca0' in the process. I suspect that it comes from the following erlang call `:os.cmd(to_charlist "echo à")`. I will investigate... – Archimondain Sep 04 '19 at 22:32
  • 2
    I finally solved the issue : it was docker who was messing up with my strings. I needed to add "ENV LANG C.UTF-8" and "ENV LC_ALL C.UTF-8" in the conf – Archimondain Sep 05 '19 at 00:40