Sending UTF-8 with sockets

Question

I'm tring to setup a little chat program in python. Everything was working fine until I sent a string containing a non ascii character that caused the program to crash. The string are read from a wx.TestCtrl

How can I send a string with UTF-8 encoding over sockets?
Why does the program work without problems at the start? I have set the encoding to UTF-8 so wouldn't all character cause the program to crash?

Here is the error:

Traceback (most recent call last):
  File "./client.py", line 180, in sendMess
    outSock.sendto(s,self.serveraddr)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 26: 
                    ordinal not in range(128)

Here is how I create the socket and try to send the message:

  outSock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
  ....
  outSock.sendto(s,self.serveraddr)

http://stackoverflow.com/questions/1644640/how-to-handle-unicode-non-ascii-characters-in-python — Matt Ball, Mar 17 '12 at 18:17
Thanks! So there are no way to send the string without decoding it? — nist, Mar 17 '12 at 18:20
You don't decode to send, you *encode* - you take your unicode strings (which are *not* UTF-8, or at least don't have to), convert them to bytes, and send those bytes. Also see http://nedbatchelder.com/text/unipain.html for more background information. — , Mar 17 '12 at 18:20
The data you send over the socket is just a stream of bytes, the socket do not know or care what it is. It's up to the receiver to decode the data in a meaningful way. — Some programmer dude, Mar 17 '12 at 18:21

score 8 · Accepted Answer · answered Mar 17 '12 at 18:21

8

In Python 2, socket.sendto on a socket takes a "plain" string, not a unicode object. Therefore you must encode it, say using UTF-8:

outSock.sendto(s.encode('utf-8'), self.serveraddr)

Similarly, when you recvfrom (or similar) at the other end, you'll need to convert back to a Unicode object:

unicode_string = s.decode('utf-8')

(In Python 3, you'll be working with bytes, which makes the need to convert between it and unicode more explicit.)

answered Mar 17 '12 at 18:21

James Aylett

3,332
19
20

1

It is an interesting issue with python 3 because you might get an incomplete unicode char. – arhuaco Sep 03 '14 at 07:38
2

That's true with python 2 also, though; `s.decode('utf-8')` will explode all over you if you give it a partial UTF-8 sequence. Generally you'd use streams rather than datagrams for this so you know when you've got an entire message (or perhaps you'd implement something similar in datagrams, or constrain message lengths so fragmentation isn't a risk or something). – James Aylett Sep 07 '14 at 12:26

Sending UTF-8 with sockets

1 Answers1

Linked