258

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).

Is there another way to check?

Amir
  • 10,600
  • 9
  • 48
  • 75
Nico
  • 2,599
  • 2
  • 15
  • 5

16 Answers16

284

I think you are not asking the right question--

A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.

Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"
Vincent Marchetti
  • 4,768
  • 3
  • 18
  • 8
  • 38
    use encode is better, because string no decode method in python 3, see [what's the difference between encode/decode? (python 2.x)](http://stackoverflow.com/questions/447107/whats-the-difference-between-encode-decode-python-2-x/449281#449281) – Jet Guo May 08 '11 at 15:21
  • 1
    @Sri: That is because you are using it on an unencoded string (`str` in Python 2, `bytes` in Python 3). – dotancohen Dec 29 '13 at 07:15
  • In Python 2, this solution only works for a _unicode_ string. A `str` in any ISO encoding would need to be encoded to Unicode first. The answer should go into this. – alexis Oct 07 '14 at 09:49
  • @JetGuo: you should use both depending on the input type: `s.decode('ascii') if isinstance(s, bytes) else s.encode('ascii')` in Python 3. OP's input is a bytestring `'é'` (Python 2 syntax, Python 3 hadn't been released at the time) and therefore `.decode()` is correct. – jfs Sep 04 '15 at 10:36
  • @Sri: you have passed a Unicode string instead of a bytestring. Python 2 tries to encode implicitly using `sys.getdefaultencoding()` (`'ascii'`). If your input is Unicode; use `.encode('ascii', 'strict')` explicitly instead. – jfs Sep 04 '15 at 10:36
  • 2
    @alexis: wrong. `str` on Python 2 is a bytestring. It is correct to use `.decode('ascii')` to find out whether all bytes are in the ascii range. – jfs Sep 04 '15 at 10:36
  • @JetGuo OP's question was about Python 2. Using `mystring.encode('ascii')` here will actually first try to decode the string using the system default to create a unicode string, and then re-encode that string as the specified encoding, which is a bad way to do it. – augurar Jan 25 '17 at 05:16
231
def is_ascii(s):
    return all(ord(c) < 128 for c in s)
Alexander Kojevnikov
  • 17,580
  • 5
  • 49
  • 46
  • 109
    Pointlessly inefficient. Much better to try s.decode('ascii') and catch UnicodeDecodeError, as suggested by Vincent Marchetti. – ddaa Oct 13 '08 at 18:48
  • 27
    It's not inefficient. all() will short-circuit and return False as soon as it encounters an invalid byte. – John Millikin Oct 13 '08 at 20:03
  • 11
    Inefficient or not, the more pythonic method is the try/except. – Jeremy Cantrell Oct 14 '08 at 14:38
  • 46
    It is inefficient compared to the try/except. Here the loop is in the interpreter. With the try/except form, the loop is in the C codec implementation called by str.decode('ascii'). And I agree, the try/except form is more pythonic too. – ddaa Oct 16 '08 at 17:55
  • 3
    -1 Not only is the loop over Python code instead of C code, but also there's a Python function call `ord(c)` -- UGLY -- at the very least use `c <= "\x7F"` instead. – John Machin Jul 21 '10 at 07:54
  • 36
    @JohnMachin `ord(c) < 128` is infinitely more readable and intuitive than `c <= "\x7F"` – Slater Victoroff Jun 28 '13 at 16:04
  • 7
    The "inefficiency" depends on the length of the strings and the likelihood of ASCII data; for short strings that are not ASCII, this function may well be faster than setting up try/except block and handling the exception. – Alex Dupuy Jul 02 '13 at 19:27
  • 9
    Using `try-catch` block is not pythonic, but abusing. – Fermat's Little Student Dec 12 '17 at 05:09
  • 4
    It seems people are crazy about being pythonic than being sane. This answer is much easier for me than the try except because I want to use it inside a list comprehension. I want to be filtering a large text corpus throwing out every words not ascii. How would you do it with try..except? – Binu Jasim May 30 '18 at 10:00
  • 1
    @BinuJasim You can still filter with the try/catch method in a list comprehension if you wrap it in a function that returns True or False based on whether the exception was raised. Not aruging for either method though; efficiency aside, I think both methods are fine as long as it exists it an appropriately named function. – user193130 Oct 02 '19 at 20:43
  • Not sure about the efficiency of try except here. When you catch an error like that the whole stack gets torn down rather than simple iteration. Would have to actually try it on a data set and time the result to be sure either way. – Sam Redway Mar 23 '21 at 14:45
  • 1
    In my only experiment, using the `try-except` method with `encode` on about 4Gbytes of text was about 30% faster than using a list comprehension in Python 3.7.6. – Hbar Jun 29 '21 at 17:17
  • try/except is not pythonic, it's imperative style. – Manuel Castejón Limas Oct 28 '21 at 06:20
  • wow, I have never though I would find the best answer – Ice Bear Mar 22 '23 at 05:38
186

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True
wjandrea
  • 28,235
  • 9
  • 60
  • 81
far
  • 2,697
  • 1
  • 17
  • 9
  • 9
    This is a nice little trick to detect non-ascii characters in Unicode strings, which in python3 is pretty much all the strings. Since ascii characters can be encoded using only 1 byte, so any ascii characters length will be true to its size after encoded to bytes; whereas other non-ascii characters will be encoded to 2 bytes or 3 bytes accordingly which will increase their sizes. – Devy Mar 04 '16 at 17:13
  • By @far the best answer, but not that some chars like … and — may look like ascii, so in case you want to use this to detect english text make you replace such chars before checking – Christophe Roussy Jul 28 '16 at 12:25
  • 2
    But in Python2 it'll throw an UnicodeEncodeError. Got to find a solution for both Py2 and Py3 – alvas Jan 25 '17 at 06:43
  • 13
    This is just plain wasteful. It encodes a string in UTF-8, creating a whole other bytestring. True Python 3 way is `try: s.encode('ascii'); return True` `except UnicodeEncodeError: return False` (Like above, but encoding, as strings are Unicode in Python 3). This answer also raises an error in Python 3 when you have surrogates (e.g. `isascii('\uD800')` raises an error instead of returning `False`) – Artyer Jul 26 '17 at 13:36
  • This looks quite beautiful, but I wonder if it's as efficient as `all` when handling a long string – Endle_Zhenbo Apr 19 '18 at 21:02
  • @ChristopheRoussy - Those characters (code points) would be encoded to several bytes, so what you are saying makes no sense. – Shuzheng Feb 23 '21 at 12:58
  • I did a quick benchmark of @Artyer suggestion in comparison to the suggested answer in this post and found it to be slightly faster. – Wayne Workman Apr 01 '23 at 16:10
166

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True
Taku
  • 31,927
  • 11
  • 74
  • 85
  • 8
    `"\x03".isascii()` is also True. The documentation says this just checks that all characters are below code point 128 (0-127). If you also want to avoid control characters, you will need: `text.isascii() and text.isprintable()`. Just using `isprintable` by itself is also not enough, as it will consider a character like ¿ to be (correctly) printable, but it's not within the ascii printable section, so you need to check both if you want both. Yet another gotcha: spaces are considered printable, tabs and newlines are not. – Luc Apr 15 '20 at 15:47
  • 3
    @Luc Good to know, but ASCII includes control chars. Avoiding them is another topic. – wjandrea Jul 13 '20 at 17:08
  • @wjandrea Yeah, obviously, but because 0x03 fits in 7 bits doesn't mean that it's what most people will want to be checking for when they find this page in their search results. – Luc Jul 13 '20 at 18:24
  • 2
    @Luc Yes, exactly. If someone thinks that all ASCII characters are safe to print, they're mistaken, but that's a valid topic and could deserve its own question. – wjandrea Jul 13 '20 at 20:43
  • It's unfortunate that there isn't some way to make this answer jump to the top other than wait for upvotes. If the OP would log on again, they could at least accept it, but it seems they haven't been seen at all since posting this question. – John Y Jun 06 '22 at 15:45
  • I bench-marked this against @far answer and against suggestions on that answer, this is ever so slightly faster. – Wayne Workman Apr 01 '23 at 16:16
29

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.

drs
  • 5,679
  • 4
  • 42
  • 67
  • OP's input is a bytestring (`bytes` type in Python 3 that has no `.encode()` method). [`.decode()` in @Vincent Marchetti's answer is correct](http://stackoverflow.com/a/196391/4279). – jfs Sep 04 '15 at 10:43
  • 1
    @J.F.Sebastian The OP asks "How to check if a string in Python is in ASCII?" and does not specify bytes vs unicode strings. Why do you say his/her input is a bytestring? – drs Sep 04 '15 at 13:07
  • 1
    look at the date of the question: `'é'` was a bytestring at the time. – jfs Sep 04 '15 at 13:19
  • 3
    @J.F.Sebastian, ok, well considering this answer answers this question as if it were asked today, I think it's still valid and helpful. Fewer and fewer people will come here looking for answers as if they were running Python in 2008 – drs Sep 04 '15 at 17:22
  • OP would not ask that question today: `ord('é') == 1` on Python 3 ([it can be >1 in general](http://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii/32357552?noredirect=1#comment292767_200267) but it is not related to the string being in the ascii range or not). Anyway, if it is not specified whether the input string is Unicode; both could be used: [`s.decode('ascii') if isinstance(s, bytes) else s.encode('ascii')`](http://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii/32357552#comment52660342_196391) – jfs Sep 04 '15 at 18:48
  • 2
    I found this question when i was searching for a solution for python3 and quickly reading the question didn't make me suspect that this was python 2 specfic. But this answer was really helpful - upvoting! – josch Oct 11 '15 at 18:51
  • I like this answer much better since I am using Python3, which all strings by definition are unicode strings as a result will have `.encode()` method to call on. This answer feels more pythonic than the other answer of comparing the character length pre and post encoding. – Devy Mar 04 '16 at 17:17
  • This is the py2- and py3-compatible solution I was looking for! – ebeezer Aug 15 '19 at 00:24
  • @jfs >>> ord('é') 233 – Kelvin Wang May 21 '22 at 04:10
  • @KelvinWang: I meant `len('é') == 1` on Python 3 (follow the link for `>1` case) -- a single Unicode code point (not 2 bytes like on Python 2 for OP). – jfs May 21 '22 at 06:20
18

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'

This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.

Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.

Glyph
  • 31,152
  • 11
  • 87
  • 129
  • 3
    'é' does *not* necessarily represent a single code point. It could be *two* code points (U+0065 + U+0301). – jfs Jan 24 '09 at 03:17
  • 2
    Each abstract character is *always* represented by a single code point. However, code points may be encoded to multiple bytes, depending on the encoding scheme. i.e., 'é' is two bytes in UTF-8 and UTF-16, and four bytes in UTF-32, but it is in each case still a single code point — U+00E9. – Ben Blank Jan 24 '09 at 03:43
  • 5
    @Ben Blank: U+0065 and U+0301 *are* code points and they *do* represent 'é' which can *also* be represented by U+00E9. Google "combining acute accent". – jfs Jan 24 '09 at 07:48
  • J.F. is right about combining U+0065 and U+0301 to form 'é' but this is not a reversible functino. You will get U+00E9. According to [wikipedia](http://en.wikipedia.org/wiki/Unicode#Ready-made_versus_composite_characters), these composite code points are useful for backwards compatibility – Martin Konecny Jul 07 '11 at 16:07
  • 1
    @teehoo - It is a reversible function in the sense that you may re-normalize the code point representing the composed character into a sequence of code points representing the same composed character. In Python you can do this like so: unicodedata.normalize('NFD', u'\xe9'). – Glyph Jul 08 '11 at 01:52
  • I've updated the answer to attempt to address some of this feedback as well as changes that have been made to the question. – Glyph Jul 08 '11 at 02:12
18

Ran into something like this recently - for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')
Alvin
  • 2,533
  • 33
  • 45
9

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you're getting a rude and persistent

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'
Community
  • 1
  • 1
Max P Magee
  • 495
  • 4
  • 10
9

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True
miya
  • 1,059
  • 1
  • 11
  • 20
  • 8
    This fails if you string contains ASCII characters which are not letters. For you code examples, that includes newline, space, dot, comma, underscore, and parentheses. – florisla Jul 13 '17 at 09:52
4

To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)
  • 3
    it works but beware [there are known issues with character classification functions from `curses.ascii`](http://bugs.python.org/issue9770) – jfs Oct 02 '15 at 08:44
2

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.

Steve Moyer
  • 5,663
  • 1
  • 24
  • 34
  • The `re` module in the Python standard library does not support POSIX character classes. – Flux Nov 23 '21 at 01:13
2

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.

JacquesB
  • 41,662
  • 13
  • 71
  • 86
1

Like @RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.

Community
  • 1
  • 1
hobs
  • 18,473
  • 10
  • 83
  • 106
0
import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
-2

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False
  • 1
    This `try` wrapper is completely pointless. If `"¶"` is a Unicode string, then `ord("¶")` will work, and if it’s not (Python 2), `for c in s` will decompose it into bytes so `ord` will continue to work. – Ry- Oct 19 '19 at 07:07
-5

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False
  • 5
    -1 AARRGGHH this is treating all characters with ord(c) in range(128, 256) as ASCII!!! – John Machin Jul 21 '10 at 07:56
  • Doesn't work. Try calling the following: `is_ascii(u'i am ascii')`. Even though the letters and spaces are definitely ASCII, this still returns `False` because we forced the string to be `unicode`. – jpmc26 Feb 19 '14 at 19:30