How to convert a string to utf-8 in Python

Question

I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII. How can I convert the plain string to utf-8?

NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.

Try this link [http://evanjones.ca/python-utf8.html](http://evanjones.ca/python-utf8.html) — Mudassir, Nov 15 '10 at 08:33
I think a better title would be **How to coerce a string to unicode without translation?** — boatcoder, Aug 11 '16 at 22:05
In 2018, python 3 if you get ascii decode error do `"some_string".encode('utf-8').decode('utf-8')` — devssh, Sep 26 '18 at 08:40

score 312 · Accepted Answer · edited Aug 31 '20 at 01:00

312

In Python 2

>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)

^ This is the difference between a byte string (plain_string) and a unicode string.

>>> s = "Hello!"
>>> u = unicode(s, "utf-8")

^ Converting to unicode and specifying the encoding.

In Python 3

All strings are unicode. The unicode function does not exist anymore. See answer from @Noumenon

edited Aug 31 '20 at 01:00

Maxime

38
1
6

answered Nov 15 '10 at 08:31

user225312

126,773
69
172
181

37

,I am getting the following error: `UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 2: invalid start byte` This is my code: ret=[] for line in csvReader: cline=[] for elm in line: unicodestr = unicode(elm, 'utf-8') cline.append(unicodestr) ret.append(cline) – Gopakumar N G Oct 22 '13 at 06:56
131

None of this applies in Python 3, all strings are unicode and `unicode()` doesn't exist. – Noumenon Aug 28 '15 at 12:00
Kind of bumping this, but thanks. This fixed an issue where I was trying to print unicode and was getting �s. – 智障的人 Feb 07 '16 at 17:53
1

How to you convert `u` back to a `str` format (convert `u` back to `s`)? – Tanguy Aug 25 '17 at 13:25
3

This code will only work as long as the text does not contain non-ascii characters; a simple accented character on the string will make it fail. – Haroldo_OK Feb 16 '18 at 10:31
1

Hi, if you have `"2340"` in a string variable, and you want to print the unicode character `U+2340` (⍀), is there any way to do that? – Sha2b Nov 05 '19 at 03:36
1

@Sha2b `chr(0x2340)` gives: `⍀` – U13-Forward Oct 25 '21 at 04:49

score 83 · Answer 2 · answered Oct 07 '13 at 17:00

83

If the methods above don't work, you can also tell Python to ignore portions of a string that it can't convert to utf-8:

stringnamehere.decode('utf-8', 'ignore')

answered Oct 07 '13 at 17:00

duhaime

25,611
17
169
224

33

Got AttributeError: 'str' object has no attribute 'decode' – saran3h Aug 06 '18 at 14:06
3

@saran3h it sounds like you're using Python 3, in which case Python *should* handle encoding issues for you. Have you tried reading your document without specifying an encoding? – duhaime Aug 06 '18 at 14:56
3

Python by default picks system encoding. In windows 10 it's cp1252 which is different from utf-8. I wasted few hours on it while using codecs.open() in py 3.8 – Vishesh Mangla Jul 01 '20 at 15:15

score 24 · Answer 3 · edited May 26 '21 at 00:17

24

Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:

def make_unicode(inp):
    if type(inp) != unicode:
        inp =  inp.decode('utf-8')
    return inp

edited May 26 '21 at 00:17

Buddy Bob

5,829
1
13
44

answered Nov 29 '14 at 19:13

Blueswannabe

241
2
2

3

This no longer works, as written... the `unicode` type doesn't exist in python3 – Mike Pennington Dec 26 '21 at 15:14

score 16 · Answer 4 · edited Apr 25 '15 at 05:17

16

Adding the following line to the top of your .py file:

# -*- coding: utf-8 -*-

allows you to encode strings directly in your script, like this:

utfstr = "ボールト"

edited Apr 25 '15 at 05:17

famousgarkin

13,687
5
58
74

answered May 22 '14 at 15:15

Ken

369
3
15

2

It is not what OP asks. But avoid such string literals anyway. It creates Unicode string in Python 3 (good) but it is a bytestring in Python 2 (bad). Either add `from __future__ import unicode_literals` at the top or use `u''` prefix. Don't use non-ascii characters in `bytes` literals. To get utf-8 bytes, you could `utf8bytes = unicode_text.encode('utf-8')` later if it is necessary. – jfs Apr 26 '15 at 01:26
1

@jfs how will `from __future__ import unicode_literals` help me to convert a string with non-ascii characters to utf-8? – Ortal Turgeman Nov 29 '18 at 17:30
@OrtalTurgeman I'm not answering the question. Look, it is a comment, not an answer. My comment addresses the issue with the code in the answer. It tries to create a bytestring with non-ascii characters on Python 2 (it is a SyntaxError on Python 3 — bytes literals forbid that). – jfs Nov 29 '18 at 17:34

score 13 · Answer 5 · answered Nov 15 '10 at 08:55

If I understand you correctly, you have a utf-8 encoded byte-string in your code.

Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).

You do that by using the unicode function or the decode method. Either:

unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")

Or:

unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")

score 13 · Answer 6 · answered Jul 26 '17 at 20:31

13

city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')

answered Jul 26 '17 at 20:31

Willem

1,304
1
8
7

score 12 · Answer 7 · edited Feb 20 '19 at 10:23

12

In Python 3.6, they do not have a built-in unicode() method. Strings are already stored as unicode by default and no conversion is required. Example:

my_str = "\u221a25"
print(my_str)
>>> √25

edited Feb 20 '19 at 10:23

Pradeep R

3
2

answered Apr 20 '17 at 15:53

Zld Productions

339
3
13

score 5 · Answer 8 · answered Nov 09 '17 at 17:24

Translate with ord() and unichar(). Every unicode char have a number asociated, something like an index. So Python have a few methods to translate between a char and his number. Downside is a ñ example. Hope it can help.

>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ

score 4 · Answer 9 · answered Sep 01 '22 at 10:20

The url is translated to ASCII and to the Python server it is just a Unicode string, eg.: "T%C3%A9st%C3%A3o"

Python understands "é" and "ã" as actual %C3%A9 and %C3%A3.

You can encode an URL just like this:

import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão

See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.

shioko · Answer 10 · 2020-10-06T09:23:22.870

First, str in Python is represented in Unicode.
Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e.g. UTF-16, ASCII, SHIFT-JIS, etc.).

When the client sends data to your server and they are using UTF-8, they are sending a bunch of bytes not str.

You received a str because the "library" or "framework" that you are using, has implicitly converted some random bytes to str.

Under the hood, there is just a bunch of bytes. You just need ask the "library" to give you the request content in bytes and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).

Decode UTF-8 encoded bytes to str: bs.decode('utf-8')
Encode str to UTF-8 bytes: s.encode('utf-8')

score 1 · Answer 11 · answered Sep 20 '21 at 22:26

1

You can use python's standard library codecs module.

import codecs
codecs.decode(b'Decode me', 'utf-8')

answered Sep 20 '21 at 22:26

haccks

104,019
25
176
264

score 0 · Answer 12 · answered Jul 19 '21 at 16:25

0

you can also do this:

from unidecode import unidecode
unidecode(yourStringtoDecode)

answered Jul 19 '21 at 16:25

Kevin

1
6

1

What is `unidecode`? Is it this https://pypi.org/project/Unidecode? Please provide info if it's a 3rd-party package, and how to install/use it. – Gino Mempin Jul 19 '21 at 23:27

score -1 · Answer 13 · edited Apr 26 '20 at 11:44

-1

Yes, You can add

# -*- coding: utf-8 -*-

in your source code's first line.

You can read more details here https://www.python.org/dev/peps/pep-0263/

edited Apr 26 '20 at 11:44

David Buck

3,752
35
31
35

answered Apr 26 '20 at 11:05

David-Star

35
3

How to convert a string to utf-8 in Python

13 Answers13

In Python 2

In Python 3

Linked

Related