27

Does anyone know if there is any builtin python method that will check if something is a valid python variable name, INCLUDING a check against reserved keywords? (so, ie, something like 'in' or 'for' would fail...)

Failing that, does anyone know of where I can get a list of reserved keywords (ie, dyanamically, from within python, as opposed to copy-and-pasting something from the online docs)? Or, have another good way of writing your own check?

Surprisingly, testing by wrapping a setattr in try/except doesn't work, as something like this:

setattr(myObj, 'My Sweet Name!', 23)

...actually works! (...and can even be retrieved with getattr!)

Josh Crozier
  • 233,099
  • 56
  • 391
  • 304
Paul Molodowitch
  • 1,366
  • 3
  • 12
  • 29

5 Answers5

60

Python 3

Python 3 now has 'foo'.isidentifier(), so that seems to be the best solution for recent Python versions (thanks fellow runciter@freenode for suggestion). However, somewhat counter-intuitively, it does not check against the list of keywords, so combination of both must be used:

import keyword

def isidentifier(ident: str) -> bool:
    """Determines if string is valid Python identifier."""

    if not isinstance(ident, str):
        raise TypeError("expected str, but got {!r}".format(type(ident)))

    if not ident.isidentifier():
        return False

    if keyword.iskeyword(ident):
        return False

    return True

Python 2

For Python 2, easiest possible way to check if given string is valid Python identifier is to let Python parse it itself.

There are two possible approaches. Fastest is to use ast, and check if AST of single expression is of desired shape:

import ast

def isidentifier(ident):
    """Determines, if string is valid Python identifier."""

    # Smoke test — if it's not string, then it's not identifier, but we don't
    # want to just silence exception. It's better to fail fast.
    if not isinstance(ident, str):
        raise TypeError("expected str, but got {!r}".format(type(ident)))

    # Resulting AST of simple identifier is <Module [<Expr <Name "foo">>]>
    try:
        root = ast.parse(ident)
    except SyntaxError:
        return False

    if not isinstance(root, ast.Module):
        return False

    if len(root.body) != 1:
        return False

    if not isinstance(root.body[0], ast.Expr):
        return False

    if not isinstance(root.body[0].value, ast.Name):
        return False

    if root.body[0].value.id != ident:
        return False

    return True

Another is to let tokenize module split the identifier into the stream of tokens, and check it only contains our name:

import keyword
import tokenize

def isidentifier(ident):
    """Determines if string is valid Python identifier."""

    # Smoke test - if it's not string, then it's not identifier, but we don't
    # want to just silence exception. It's better to fail fast.
    if not isinstance(ident, str):
        raise TypeError("expected str, but got {!r}".format(type(ident)))

    # Quick test - if string is in keyword list, it's definitely not an ident.
    if keyword.iskeyword(ident):
        return False

    readline = lambda g=(lambda: (yield ident))(): next(g)
    tokens = list(tokenize.generate_tokens(readline))

    # You should get exactly 2 tokens
    if len(tokens) != 2:
        return False

    # First is NAME, identifier.
    if tokens[0][0] != tokenize.NAME:
        return False

    # Name should span all the string, so there would be no whitespace.
    if ident != tokens[0][1]:
        return False

    # Second is ENDMARKER, ending stream
    if tokens[1][0] != tokenize.ENDMARKER:
        return False

    return True

The same function, but compatible with Python 3, looks like this:

import keyword
import tokenize

def isidentifier_py3(ident):
    """Determines if string is valid Python identifier."""

    # Smoke test — if it's not string, then it's not identifier, but we don't
    # want to just silence exception. It's better to fail fast.
    if not isinstance(ident, str):
        raise TypeError("expected str, but got {!r}".format(type(ident)))

    # Quick test — if string is in keyword list, it's definitely not an ident.
    if keyword.iskeyword(ident):
        return False

    readline = lambda g=(lambda: (yield ident.encode('utf-8-sig')))(): next(g)
    tokens = list(tokenize.tokenize(readline))

    # You should get exactly 3 tokens
    if len(tokens) != 3:
        return False

    # If using Python 3, first one is ENCODING, it's always utf-8 because 
    # we explicitly passed in UTF-8 BOM with ident.
    if tokens[0].type != tokenize.ENCODING:
        return False

    # Second is NAME, identifier.
    if tokens[1].type != tokenize.NAME:
        return False

    # Name should span all the string, so there would be no whitespace.
    if ident != tokens[1].string:
        return False

    # Third is ENDMARKER, ending stream
    if tokens[2].type != tokenize.ENDMARKER:
        return False

    return True

However, be aware of bugs in Python 3 tokenize implementation that reject some completely valid identifiers like ℘᧚, and 贈ᩭ. ast works fine though. Generally, I'd advise against using tokenize-based implemetation for actual checks.

Also, some may consider heavy machinery like AST parser to be a tad overkill. This simple implementation is self-contained and guaranteed to work on any Python 2:

import keyword
import string

def isidentifier(ident):
    """Determines if string is valid Python identifier."""

    if not isinstance(ident, str):
        raise TypeError("expected str, but got {!r}".format(type(ident)))

    if not ident:
        return False

    if keyword.iskeyword(ident):
        return False

    first = '_' + string.lowercase + string.uppercase
    if ident[0] not in first:
        return False

    other = first + string.digits
    for ch in ident[1:]:
        if ch not in other:
            return False

    return True

Here are few tests to check these all work:

assert(isidentifier('foo'))
assert(isidentifier('foo1_23'))
assert(not isidentifier('pass'))    # syntactically correct keyword
assert(not isidentifier('foo '))    # trailing whitespace
assert(not isidentifier(' foo'))    # leading whitespace
assert(not isidentifier('1234'))    # number
assert(not isidentifier('1234abc')) # number and letters
assert(not isidentifier(''))      # Unicode not from allowed range
assert(not isidentifier(''))        # empty string
assert(not isidentifier('   '))     # whitespace only
assert(not isidentifier('foo bar')) # several tokens
assert(not isidentifier('no-dashed-names-for-you')) # no such thing in Python

# Unicode identifiers are only allowed in Python 3:
assert(isidentifier('℘᧚')) # Unicode $Other_ID_Start and $Other_ID_Continue

Performance

All measurements has been conducted on my machine (MBPr Mid 2014) on the same randomly generated test set of 1 500 000 elements, 1000 000 valid and 500 000 invalid. YMMV

== Python 3:
method | calls/sec | faster
---------------------------
token  |    48 286 |  1.00x
ast    |   175 530 |  3.64x
native | 1 924 680 | 39.86x

== Python 2:
method | calls/sec | faster
---------------------------
token  |    83 994 |  1.00x
ast    |   208 206 |  2.48x
simple | 1 066 461 | 12.70x
toriningen
  • 7,196
  • 3
  • 46
  • 68
  • @RobL, it seems that it's a bug in `.isidentifier()` implementation, because `True`, as well as `None` and `False`, are not valid identifier names. Could you please file a bug report? – toriningen Jul 06 '15 at 08:51
  • @RobL, I have updated Python3 version answer. Better 2 year late than never, right? :) – toriningen Aug 10 '17 at 08:21
  • 2
    Not a bug. Those are valid identifiers according to the language definition. You have to use `keyword.iskeyword` to test for *reserved* identifiers such as def, class, True, None, False. – wim Nov 17 '17 at 23:18
  • I would suggest replacing those `if` clauses and the final `return` in the Py3 version with a single expression `return ident.isidentifier() and not keyword.iskeyword(ident)` – Eli Korvigo Feb 08 '18 at 22:13
  • 5
    I'm not sure why my answer is accepted instead of this one. My answer provides useful information, but this one actually answers the question. – asmeurer May 14 '18 at 06:20
  • I like your `ast` version the most. It also works fine in Python 3 (so it's relatively portable), **plus** it automatically disallows identifiers which happen to be keywords (assuming that's what you want, which you probably do). The latter feature is a somewhat subtle one.. – martineau Jun 19 '18 at 18:09
  • @martineau, if you're aiming Python 3 only, I'd recommend using native approach. I've updated my answer with performance measurements to help you decide. – toriningen Jun 19 '18 at 22:10
  • toriningen: Thanks for the performance information update. but it's not really an apples-to-apples comparison because determination of whether the identifier is a keyword requires an extra step with some of them. Regardless, in my use case, the code is being designed to run with with _both_ version 2.7+ and 3.x of Python. Fortunately the part of it that needs this functionality isn't in a performance-critical section—so your `ast` approach seems like the overall best choice for it. – martineau Jun 20 '18 at 00:00
14

The keyword module contains the list of all reserved keywords:

>>> import keyword
>>> keyword.iskeyword("in")
True
>>> keyword.kwlist
['and', 'as', 'assert', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'exec', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'not', 'or', 'pass', 'print', 'raise', 'return', 'try', 'while', 'with', 'yield']

Note that this list will be different depending on what major version of Python you are using, as the list of keywords changes (especially between Python 2 and Python 3).

If you also want all builtin names, use __builtins__

>>> dir(__builtins__)
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning', 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError', 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FileExistsError', 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'PermissionError', 'ProcessLookupError', 'ReferenceError', 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_', '__build_class__', '__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'ascii', 'bin', 'bool', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'vars', 'zip']

And note that some of these (like copyright) are not really that big of a deal to override.

One more caveat: note that in Python 2, True, False, and None are not considered keywords. However, assigning to None is a SyntaxError. Assigning to True or False is allowed, though not recommended (same with any other builtin). In Python 3, they are keywords, so this is not an issue.

asmeurer
  • 86,894
  • 26
  • 169
  • 240
  • The above list doesn't include the names of built-in objects/ types, so it wouldn't catch other common mistakes (like naming a text variable "str" or a list "list"). I'm not sure how to retrieve a list of these programatically, aside from using help(__builtins__) in an interactive python command line. – abought Oct 03 '12 at 02:03
  • 1
    I added how to get all builtin names. – asmeurer Oct 03 '12 at 02:07
  • You *can* use it as a variable name, but it's not generally a good idea to shadow built-in functions or variable types; it can interfere with legitimate uses. ( http://wiki.python.org/moin/BeginnerErrorsWithPythonProgramming ) – abought Oct 03 '12 at 02:08
  • but you question didnt mention shadowing builtins ... but its true just check for membership in the dir above... – Joran Beasley Oct 03 '12 at 02:09
  • 2
    I also added one caveat about `None` in Python 2. It is not considered a keyword, but assigning to it is a SyntaxError. – asmeurer Oct 03 '12 at 02:11
  • then use one of my second two solutions and check for membership in buitins also ... or a small list of "extra" words – Joran Beasley Oct 03 '12 at 02:14
  • this, combined with a regexp for valid python names, did the trick... thanks! – Paul Molodowitch Dec 12 '12 at 23:25
  • 1
    Another small caveat. Apparently keywords that are only available through `__future__` imports are always in this list (e.g., `with` in Python 2.5). – asmeurer Dec 16 '12 at 03:33
13

John: as a slight improvement, I added a $ in the re, otherwise, the test does not detect spaces:

import keyword 
import re
my_var = "$testBadVar"
print re.match("[_A-Za-z][_a-zA-Z0-9]*$",my_var) and not keyword.iskeyword(my_var)
Roeland Huys
  • 149
  • 1
  • 3
0

The list of python keywords is short so you can just check syntax with a simple regex and membership in a relatively small list of keywords

import keyword #thanks asmeurer
import re
my_var = "$testBadVar"
print re.match("[_A-Za-z][_a-zA-Z0-9]*",my_var) and not keyword.iskeyword(my_var)

a shorter but more dangerous alternative would be

my_bad_var="%#ASD"
try:exec("{0}=1".format(my_bad_var))
except SyntaxError: #this maynot be right error
   print "Invalid variable name!"

and lastly a slightly safer variant

my_bad_var="%#ASD"

try:
  cc = compile("{0}=1".format(my_bad_var),"asd","single")
  eval(cc)
  print "VALID"
 except SyntaxError: #maybe different error
  print "INVALID!"
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • 2
    Don't use a predetermined set of keywords. That will not be portable between Python versions, where the keyword list changes. – asmeurer Oct 03 '12 at 02:00
  • 1
    there are alternatives that dont ... but your method is better :P (now fixed) – Joran Beasley Oct 03 '12 at 02:02
  • 1
    Shouldn't the regular expression include `_` in the second part as well? Also, you have `0-0` instead of `0-9`. – asmeurer Oct 03 '12 at 02:12
  • thank you on both counts :L .... fixed (on a side note i think if you asked *the google* it would find an exact regex from python specs for you or at least a definition) – Joran Beasley Oct 03 '12 at 02:13
  • 1
    The regex should include `$` at the end. Otherwise it will match as long as it starts with a valid identifier. – rdb Jun 11 '15 at 11:07
0

I needed to check for Python 3 identifiers from Python 2 code. I used a regex based on the docs:

import keyword
import regex


def is_py3_identifier(ident):
    """Checks that ident is a valid Python 3 identifier according to
    https://docs.python.org/3/reference/lexical_analysis.html#identifiers
    """
    return bool(
        ID_REGEX.match(unicodedata.normalize('NFKC', ident)) and
        not PY3_KEYWORDS.contains(ident))

# See https://docs.python.org/3/reference/lexical_analysis.html#identifiers
ID_START_REGEX = (
    r'\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}'
    r'_\u1885-\u1886\u2118\u212E\u309B-\u309C')
ID_CONTINUE_REGEX = ID_START_REGEX + (
    r'\p{Mn}\p{Mc}\p{Nd}\p{Pc}'
    r'\u00B7\u0387\u1369-\u1371\u19DA')
ID_REGEX = regex.compile(
    "[%s][%s]*$" % (ID_START_REGEX, ID_CONTINUE_REGEX), regex.UNICODE)


PY3_KEYWORDS = frozenset('False', 'None', 'True']).union(keyword.kwlist)

Note: this uses the regex package, not the built-in re package for matching against unicode categories. Also: this will reject nonlocal which is a keyword in Python 2 but not Python 3.

Will Manley
  • 2,340
  • 22
  • 17