1

I have an app that submit a request to a Python webserver. The app has a UTF8 string with the following contents:

la langue franþaise.ppt

This is put into a HTTP header, and somehow converted as such:

la langue fran\xfeaise.ppt

Then Python on the web-server tried to do something with the string that maybe expects it to be UTF8, and I get this error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 14: invalid start byte

I would basically like to preserve this UTF8 from the app to the web-server, such that the variable would contain the following if I printed it:

la langue franþaise.ppt

What's the best way to preserve a UTF8 string from a web client and server (assuming both written in Python)?

aaa90210
  • 11,295
  • 13
  • 51
  • 88
  • Without any more information (web server you are using, etc) I can't give exact answer, but one quick work around would be to encode the string in base64 – fileoffset Aug 01 '14 at 06:42
  • @fileoffset It is a django app, sometimes running under mod_wsgi, sometimes under FCGI, sometimes using the built in appserver. I might try the base64 thing, but I was hoping there would be a "just works" sort of string escaping that Python would understand. One of the problems with Base64 is it becomes useless when quickly examining server logs to see what headers were passed in. – aaa90210 Aug 01 '14 at 09:13

4 Answers4

2

\xfe is ISO-8859-1 encoding for þ.

While utf8 in content is widely supported, HTTP headers should be ASCII. The HTTP spec allows ISO-8859-1, but it's not recommended or reliable in tooling. Other encodings are not allowed without special escaping.

If possible, escape your special chars in a way that allows them to be transferred as ASCII. Base64 as suggested by fileoffset is one option, another would be the quote function from urllib.parse (or urrlib on python2)

Jason S
  • 13,538
  • 2
  • 37
  • 42
2

HTTP headers are strictly 7-bit US ASCII. The RFC allows you to accept ISO8859-1as a compatibility hack, but don't send any byte beyond 127.

There is no standard or best way to send any other data type beside ASCII in the headers. It is your application's responsibility to encode arbitrary sequences of bytes (and your UTF string is an arbitrary sequence of bytes) such that the encoding is 7-bit safe.

Use whatever is most convenient for both client and server in their implementation language(s). Base64 encoding, \hh byte escapes, \uhhhh unicofe character escapes, %hh as per URL encoding, =HH as in MIME, or &#... entities. All of these methods exist and are being used in the wild.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
0

Try decoding your string with the codec: 'iso-8859-1' For more details check here

Community
  • 1
  • 1
ashwinjv
  • 2,787
  • 1
  • 23
  • 32
0

You have a byte string (its already decoded).

To print it, you need to first encode it so that the combination \xfe can be translated into its character equivalent.

In order to know what \xfe should be, you need to tell Python the encoding you wish to use when printing it - you also need to make sure that where you are printing it (for example, on the Terminal) the font can handle the character symbol; otherwise you'll get garbage output.

If everything works correctly, you'll get the following:

>>> i = "la langue fran\xfeaise.ppt"
>>> print(i.decode('iso-8859-1'))
la langue franþaise.ppt

Note that your string is not UTF-8 encoded, so if you try to decode it as UTF-8, you'll get this familiar error:

>>> print(i.decode('utf-8'))
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 14:
invalid start byte

To convert it, you first have to decode it from its original character set, then re-encode it as utf-8:

>>> z = i.decode('iso-8859-1').encode('utf-8')
>>> z
'la langue fran\xc3\xbeaise.ppt'
>>> i
'la langue fran\xfeaise.ppt'

Notice the differences in the bytes that represent the same character. In the end, when you print it, it will print correctly (assuming again, your terminal font can handle the glyphs):

>>> print(z.decode('utf-8'))
la langue franþaise.ppt
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284