How do I check if a string is unicode or ascii?

Question

What do I have to do in Python to figure out which encoding a string has?

@Johnsyweb Because of `{UnicodeDecodeError} 'ascii' codec can't decode byte 0xc2` — alex, Oct 10 '18 at 15:21

score 322 · Accepted Answer · edited Oct 12 '18 at 17:37

322

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

edited Oct 12 '18 at 17:37

user2357112

260,549
28
431
505

answered Feb 13 '11 at 22:40

Greg Hewgill

951,095
183
1,149
1,285

3

@ProsperousHeart: You're probably using Python 3. – Greg Hewgill Apr 10 '19 at 04:22
Note: *first*, you need to confirm you're running Python2. If your code is designed to run under either Python2 or Python3, you'll need to check your Python version first. – Edward Falk Sep 03 '21 at 21:58

Mikel · Answer 2 · 2019-04-12T06:33:49.483

135

How to tell if an object is a unicode string or a byte string

You can use type or isinstance.

In Python 2:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

In Python 2, str is just a sequence of bytes. Python doesn't know what its encoding is. The unicode type is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

In Python 3:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

In Python 3, str is like Python 2's unicode, and is used to store text. What was called str in Python 2 is called bytes in Python 3.

How to tell if a byte string is valid utf-8 or ascii

You can call decode. If it raises a UnicodeDecodeError exception, it wasn't valid.

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

edited Apr 12 '19 at 06:33

answered Feb 13 '11 at 22:33

Mikel

24,855
8
65
66

1

Just for other people's reference - str.decode doesn't not exist in python 3. Looks like you have to `unicode(s, "ascii")` or something – Shadow Aug 05 '16 at 05:50
4

Sorry, I meant ``str(s, "ascii")`` – Shadow Nov 09 '16 at 22:11
1

This is not accurate for python 3 – ProsperousHeart Apr 10 '19 at 04:22
2

@ProsperousHeart Updated to cover Python 3. And to try to explain the difference between bytestrings and unicode strings. – Mikel Apr 12 '19 at 06:35
decode() method's default is 'utf-8'. So, if you call this method over a class 'bytes', you would get a 'OK' with `print("utf8 content:", html.decode())`, for example. – RicHincapie Aug 24 '20 at 19:54

ThinkBonobo · Answer 3 · 2016-06-23T13:01:22.167

48

In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.

isinstance(x, str)

With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.

If you want to check if you have a 'string-like' object all with one statement though, you can do the following:

isinstance(x, basestring)

edited Jun 23 '16 at 13:01

answered Sep 09 '13 at 20:24

ThinkBonobo

15,487
9
65
80

This is false. In Python 2.7 `isinstance(u"x",basestring)` returns `True`. – PythonNut Jan 24 '14 at 02:04
11

@PythonNut: I believe that was the point. The use of isinstance(x, basestring) suffices to replace the distinct dual tests above. – KQ. Mar 24 '14 at 04:28
No, but `isinstance(x, basestring)` is True for both unicode and regular strings, making the test useless. – PythonNut Mar 24 '14 at 15:21
5

It's useful in many cases, but evidently not what the questioner meant. – mhsmith Feb 19 '15 at 16:32
3

This is the answer to the question. All others misunderstood what OP said and gave generic answers about type checking in Python. – fiatjaf Apr 11 '15 at 16:20
1

Doesn't answer OP's question. The title of the question (alone) COULD be interpreted such that this answer is correct. However, OP specifically says "figure out which" in the question's description, and this answer does not address that. – MD004 Jan 22 '18 at 22:42

score 35 · Answer 4 · answered May 21 '12 at 14:12

35

Unicode is not an encoding - to quote Kumar McMillan:

If ASCII, UTF-8, and other byte strings are "text" ...

...then Unicode is "text-ness";

it is the abstract form of text

Have a read of McMillan's Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.

answered May 21 '12 at 14:12

Alex Dean

15,575
13
63
74

1

Those slides are probably the best introduction to Unicode I've come across to date – Jonny Oct 18 '18 at 11:15

score 25 · Answer 5 · answered Aug 14 '12 at 12:33

25

If your code needs to be compatible with both Python 2 and Python 3, you can't directly use things like isinstance(s,bytes) or isinstance(s,unicode) without wrapping them in either try/except or a python version test, because bytes is undefined in Python 2 and unicode is undefined in Python 3.

There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here's an example:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

Those are both unpythonic, and most of the time there's probably a better way.

answered Aug 14 '12 at 12:33

Dave Burton

2,960
29
19

7

The better way is probably to use `six`, and test against `six.binary_type` and `six.text_type` – Ian Clelland Sep 26 '12 at 19:43
1

You can use *type(s).__name__* to probe type names. – Paulo Freitas Aug 26 '13 at 16:19
I am not quite sure of the use case for that bit of code, unless there is a logic error. I think there should be a "not" in the python 2 code. Otherwise you are converting everything to unicode strings for Python 3 and the opposite for Python 2! – oligofren Jun 22 '14 at 09:56
Yes, oligofren, that's what it does. The standard internal strings are Unicode in Python 3 and ASCII in Python 2. So the code snippets convert text to standard internal string type (be it Unicode or ASCII). – Dave Burton Aug 23 '16 at 15:25

score 13 · Answer 6 · edited Apr 10 '19 at 05:57

13

use:

import six
if isinstance(obj, six.text_type)

inside the six library it is represented as:

if PY3:
    string_types = str,
else:
    string_types = basestring,

edited Apr 10 '19 at 05:57

ProsperousHeart

188
1
14

answered Aug 08 '16 at 08:50

madjardi

5,649
2
37
37

2

it should be `if isinstance(obj, six.text_type) `. But yes this is imo the correct answer. – karantan Aug 29 '16 at 07:50
Doesn't answer OP's question. The title of the question (alone) COULD be interpreted such that this answer is correct. However, OP specifically says "figure out which" in the question's description, and this answer does not address that. – MD004 Jan 22 '18 at 22:44

score 4 · Answer 7 · answered Jul 09 '14 at 02:35

Note that on Python 3, it's not really fair to say any of:

strs are UTFx for any x (eg. UTF8)
strs are Unicode
strs are ordered collections of Unicode characters

Python's str type is (normally) a sequence of Unicode code points, some of which map to characters.

Even on Python 3, it's not as simple to answer this question as you might imagine.

An obvious way to test for ASCII-compatible strings is by an attempted encode:

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

The error distinguishes the cases.

In Python 3, there are even some strings that contain invalid Unicode code points:

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

The same method to distinguish them is used.

score 3 · Answer 8 · edited May 13 '18 at 10:10

This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

You my friend deserve to be the correct response! I am using python 3 and I was still having problems until I found this treasure! — Mansour.M, Nov 15 '19 at 15:40

score 2 · Answer 9 · edited May 12 '14 at 14:08

2

You could use Universal Encoding Detector, but be aware that it will just give you best guess, not the actual encoding, because it's impossible to know encoding of a string "abc" for example. You will need to get encoding information elsewhere, eg HTTP protocol uses Content-Type header for that.

edited May 12 '14 at 14:08

Tom Morris

3,979
2
25
43

answered Feb 13 '11 at 22:34

Seb

17,141
7
38
27

score 2 · Answer 10 · answered Apr 07 '21 at 16:05

In Python-3, I had to understand if string is like b='\x7f\x00\x00\x01' or b='127.0.0.1' My solution is like that:

def get_str(value):
    str_value = str(value)
    
    if str_value.isprintable():
        return str_value

    return '.'.join(['%d' % x for x in value])

Worked for me, I hope works for someone needed

score 0 · Answer 11 · answered May 28 '18 at 11:56

0

For py2/py3 compatibility simply use

import six if isinstance(obj, six.text_type)

answered May 28 '18 at 11:56

Vishvajit Pathak

3,351
1
21
16

score 0 · Answer 12 · answered Sep 18 '19 at 14:24

One simple approach is to check if unicode is a builtin function. If so, you're in Python 2 and your string will be a string. To ensure everything is in unicode one can do:

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

How do I check if a string is unicode or ascii?

12 Answers12

How to tell if an object is a unicode string or a byte string

How to tell if a byte string is valid utf-8 or ascii

Linked