In Python, how to check if a string only contains certain characters?

Question

I need to check a string containing only a..z, 0..9, and . (period) and no other character.

I could iterate over each character and check the character is a..z or 0..9, or . but that would be slow.

I am not clear now how to do it with a regular expression.

Is this correct? Can you suggest a simpler regular expression or a more efficient approach.

#Valid chars . a-z 0-9 
def check(test_str):
    import re
    #http://docs.python.org/library/re.html
    #re.search returns None if no position in the string matches the pattern
    #pattern to search for any character other then . a-z 0-9
    pattern = r'[^\.a-z0-9]'
    if re.search(pattern, test_str):
        #Character other then . a-z 0-9 was found
        print 'Invalid : %r' % (test_str,)
    else:
        #No character other then . a-z 0-9 was found
        print 'Valid   : %r' % (test_str,)

check(test_str='abcde.1')
check(test_str='abcde.1#')
check(test_str='ABCDE.12')
check(test_str='_-/>"!@#12345abcde<')

'''
Output:
>>> 
Valid   : "abcde.1"
Invalid : "abcde.1#"
Invalid : "ABCDE.12"
Invalid : "_-/>"!@#12345abcde<"
'''

Looks fine to me. You don't need the backslash before the . if you're in a character class, but that's only a one character saving ;) — Alice Purcell, Aug 24 '09 at 16:26
@Ingenutrix, John indeed found a bug in my answer. I think his solution is the best. — Nadia Alramli, Aug 25 '09 at 13:00
See also Tim Peters' answer to this question: [How to check if a string contains only characters from a given set in python](https://stackoverflow.com/q/20726010/2102457) — mattst, Dec 11 '19 at 18:17
If you want to *transform* the string to contain only the specified characters, see https://stackoverflow.com/questions/15754587. In some special cases, additional techniques apply: e.g. https://stackoverflow.com/questions/1450897 for digits only. Also see https://stackoverflow.com/questions/295135 specifically for creating valid file names. — Karl Knechtel, Aug 01 '22 at 20:21

John Millikin · Answer 1 · 2009-08-25T15:15:02.313

95

Here's a simple, pure-Python implementation. It should be used when performance is not critical (included for future Googlers).

import string
allowed = set(string.ascii_lowercase + string.digits + '.')

def check(test_str):
    set(test_str) <= allowed

Regarding performance, iteration will probably be the fastest method. Regexes have to iterate through a state machine, and the set equality solution has to build a temporary set. However, the difference is unlikely to matter much. If performance of this function is very important, write it as a C extension module with a switch statement (which will be compiled to a jump table).

Here's a C implementation, which uses if statements due to space constraints. If you absolutely need the tiny bit of extra speed, write out the switch-case. In my tests, it performs very well (2 seconds vs 9 seconds in benchmarks against the regex).

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject *check(PyObject *self, PyObject *args)
{
        const char *s;
        Py_ssize_t count, ii;
        char c;
        if (0 == PyArg_ParseTuple (args, "s#", &s, &count)) {
                return NULL;
        }
        for (ii = 0; ii < count; ii++) {
                c = s[ii];
                if ((c < '0' && c != '.') || c > 'z') {
                        Py_RETURN_FALSE;
                }
                if (c > '9' && c < 'a') {
                        Py_RETURN_FALSE;
                }
        }

        Py_RETURN_TRUE;
}

PyDoc_STRVAR (DOC, "Fast stringcheck");
static PyMethodDef PROCEDURES[] = {
        {"check", (PyCFunction) (check), METH_VARARGS, NULL},
        {NULL, NULL}
};
PyMODINIT_FUNC
initstringcheck (void) {
        Py_InitModule3 ("stringcheck", PROCEDURES, DOC);
}

Include it in your setup.py:

from distutils.core import setup, Extension
ext_modules = [
    Extension ('stringcheck', ['stringcheck.c']),
],

Use as:

>>> from stringcheck import check
>>> check("abc")
True
>>> check("ABC")
False

edited Aug 25 '09 at 15:15

answered Aug 24 '09 at 16:24

John Millikin

197,344
39
212
226

Interesting! 1) Would set(test_str) get all the characters in same order as 'allowed'? 2) I will have to check to speed of 'set(test_str) == allowed' as compared to re.search later but is this faster? any idea? – X10 Aug 24 '09 at 16:34
sets are unordered. Since they are implemented using `hash()`, checking for set membership is probably the fastest pure-Python solution. As mentioned, if you need better performance, use switch-case in C. – John Millikin Aug 24 '09 at 16:37
2

@Nadia: your solution is incorrect. If I wanted results which are fast and wrong, I would ask my cat. – John Millikin Aug 24 '09 at 17:09
@John, please you don't have to be rude about it. So far there are 2 solutions that are about 3 times faster than this one. – Nadia Alramli Aug 24 '09 at 17:22
@John, and by the way your solution is incorrect too. check('') => True – Nadia Alramli Aug 24 '09 at 17:22
5

I can't say that I like downvoting a solution as a reaction to "it's **slower** than my/another solution". If it's **wrong**, downvoting makes sense. But even in "code golf" questions, any answer that's not the smallest doesn't get downvoted, it just won't get as many upvotes over time. – Adam V Aug 24 '09 at 17:22
@Nadia: I don't understand. Are you saying check("") returns False in your tests? I think it ought to return True -- an empty string, obviously, does not contain any invalid characters. – John Millikin Aug 24 '09 at 17:25
1

@Adam, you are correct. I felt that I had to downvote it because unfortunately most users have the instinct to blindly upvote solutions just because they are on the top without reading others. Just look at Mark's solution which is obviously very slow – Nadia Alramli Aug 24 '09 at 17:28
Since performance is apparently such an issue, I've updated my answer to contain the fastest method I can think of (a C extension module). – John Millikin Aug 24 '09 at 17:45
@John, since you added a faster method. I'll take back the -1. – Nadia Alramli Aug 24 '09 at 17:50
You guys are just awesome. Specially Nadia's & your contribution has enriched this solution a lot. Going with C code is surely faster, but lot of py newbie's would go with a py solution. I had no idea different solutions in py would be so diverse when it comes to speed. This has been an eye-opener. For now, I would tend to go with Nadia's regex answer. If you have a different take on it (in python ) or another perspective, your input would surely help. – X10 Aug 24 '09 at 17:56
@John Millikin, Thanks for your solution. Not only the solution, but also seeing the steps to final solution was enlightening. – X10 Aug 24 '09 at 18:24
1

@John Millikin: -1 Your solution doesn't check for '.' AND it fails if the input contains '\x00'. What was that about your cat? – John Machin Aug 25 '09 at 04:33
The solution will not fail in the presence of a '\x00' byte -- Python will throw an exception when it encounters a rogue '\x00'. I've modified it to return False in that case, since other commenters seem to agree with you, but I doubt think that case will ever occur in real life unless a program accepts input from the user without verifying that it's valid text. – John Millikin Aug 25 '09 at 15:21
1

@John Millikin: The exception constitutes a failure. The purpose of the function is to verify that the text is valid!!! Users are unlikely to be the problem; more likely to result from some act of a careless programmer -- go looking in free-text "comment" or "note" columns in databases if you want an introduction to "real life". – John Machin Aug 25 '09 at 15:53
3

A failure would be if the function returned "true" for invalid text. An exception is unexpected, but does not allow execution to proceed along the code path for a correct string, and is thus not a failure. If data is pulled into the program from an external source, such as from a file or database, it is user input and should be checked before use. That includes checking that a string is valid UTF-8 (or whatever encoding is used for storage). – John Millikin Aug 25 '09 at 16:01
I like that solution, but I think you need an intersection here `set(test_str) - allowed` – Sep 04 '14 at 02:10
Beautiful solution with sets. – oblalex Apr 16 '20 at 19:12
Seems like it returns None because `return` is missing. `return set(test_str) <= allowed` works. – Reveille Apr 29 '20 at 15:21

John Machin · Accepted Answer · 2009-08-25T15:38:16.853

Final(?) edit

Answer, wrapped up in a function, with annotated interactive session:

>>> import re
>>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
...     return not bool(search(strg))
...
>>> special_match("")
True
>>> special_match("az09.")
True
>>> special_match("az09.\n")
False
# The above test case is to catch out any attempt to use re.match()
# with a `$` instead of `\Z` -- see point (6) below.
>>> special_match("az09.#")
False
>>> special_match("az09.X")
False
>>>

Note: There is a comparison with using re.match() further down in this answer. Further timings show that match() would win with much longer strings; match() seems to have a much larger overhead than search() when the final answer is True; this is puzzling (perhaps it's the cost of returning a MatchObject instead of None) and may warrant further rummaging.

==== Earlier text ====

The [previously] accepted answer could use a few improvements:

(1) Presentation gives the appearance of being the result of an interactive Python session:

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True

but match() doesn't return True

(2) For use with match(), the ^ at the start of the pattern is redundant, and appears to be slightly slower than the same pattern without the ^

(3) Should foster the use of raw string automatically unthinkingly for any re pattern

(4) The backslash in front of the dot/period is redundant

(5) Slower than the OP's code!

prompt>rem OP's version -- NOTE: OP used raw string!

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.43 usec per loop

prompt>rem OP's version w/o backslash

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.44 usec per loop

prompt>rem cleaned-up version of accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
100000 loops, best of 3: 2.07 usec per loop

prompt>rem accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
100000 loops, best of 3: 2.08 usec per loop

(6) Can produce the wrong answer!!

>>> import re
>>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
True # uh-oh
>>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
False

+1 Thanks for correcting my answer. I forgot that match checks for a match only at the beginning of the string. Ingenutrix, I think you should select this answer as accepted. — Nadia Alramli, Aug 25 '09 at 12:59
WOW. Getting another solution after accepting one. @John Machin, thanks for taking this up. Could you please just put the final cleaned up solution at top of your post. All these different (though great posts) will probably be confusing for another newbie who comes here searching for the final solution. Please do not change or remove anything in your post, it is great to see your explanation thru your steps. They are very informative. Thanks. — X10, Aug 25 '09 at 14:22
@Nadia: That was very gracious of you. Thanks! @Ingenutrix: Cleaned up as requested. — John Machin, Aug 25 '09 at 15:41

score 51 · Answer 3 · answered Aug 24 '09 at 16:26

51

Simpler approach? A little more Pythonic?

>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False

It certainly isn't the most efficient, but it's sure readable.

answered Aug 24 '09 at 16:26

Mark Rushakoff

249,864
45
407
398

3

`ok = dict.fromkeys("012345789abcdef")` might speed it up without hurting readability much. – jfs Aug 25 '09 at 22:21
@J.F.Sebastian: On my system the trick with dict.fromkeys and using a long and a short test-string it is only 1 to 3 % faster. (using python 3.3) – erik Jul 20 '15 at 23:11
1

@erik: use `bytes.translate` for speed. See [the discussion in the comments and the performance comparison in the answer](http://stackoverflow.com/questions/29998052/deleting-consonants-from-a-string-in-python/29998062#comment50841257_29998062) – jfs Jul 21 '15 at 02:57

Nadia Alramli · Answer 4 · 2009-08-24T18:26:03.327

17

EDIT: Changed the regular expression to exclude A-Z

Regular expression solution is the fastest pure python solution so far

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902

Compared to other solutions:

>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447

If you want to allow empty strings then change it to:

reg=re.compile('^[a-z0-9\.]*$')
>>>reg.match('')
False

Under request I'm going to return the other part of the answer. But please note that the following accept A-Z range.

You can use isalnum

test_str.replace('.', '').isalnum()

>>> 'test123.3'.replace('.', '').isalnum()
True
>>> 'test123-3'.replace('.', '').isalnum()
False

EDIT Using isalnum is much more efficient than the set solution

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766

EDIT2 John gave an example where the above doesn't work. I changed the solution to overcome this special case by using encode

test_str.replace('.', '').encode('ascii', 'replace').isalnum()

And it is still almost 3 times faster than the set solution

timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318

In my opinion using regular expressions is the best to solve this problem

edited Aug 24 '09 at 18:26

answered Aug 24 '09 at 16:28

Nadia Alramli

111,714
37
173
152

Looks like this doesn't work properly: u"ABC\u0131\u0661".replace('.','').isalnum() -> True, but should be False for the OP's test – John Millikin Aug 24 '09 at 17:08
Very interesting! Thx for speed details btw, uppercase check should fail, but that is a minor issue >>> 'A.a'.lower().replace('.', '').isalnum() True Can you please update your non-encode, encode and regex solutions to exclude A-Z. (minor issue but you guys seem to be so way ahead on this then I am, I don't want to place .lower(). at wrong place and mess up the answer) My primary concern was to be sure my solution is correct but I am sure glad I posted the problem here as speed is very important. This check will be done a few million times, and having seen the speed results, it does matter! – X10 Aug 24 '09 at 17:34
!! I think I was wrong about A.a'.lower().replace('.', '').isalnum()..this best left to you experts. – X10 Aug 24 '09 at 17:36
Nadia, your earlier detailed post was far more informative and educational, (even if it deviated a bit from the question). If you can restore it, please do. Just reading thru it helps newbies like me. – X10 Aug 24 '09 at 17:59
If you do decide to go with this approach, one other performance note is that you should probably compile the regexp once and then re-use the compiled version instead of compiling it everytime you call the function. Compiling a regexp is a pretty time consuming process. – Brent Writes Code Aug 24 '09 at 18:22
@Nadia, Thanks for taking time and effort to provide your solution. I do feel you may want to have left your detailed post as is. As I watched you and John, update your answers, it was more informative then just the final solution. Special thanks for providing the timing details. – X10 Aug 24 '09 at 18:27
@Ingenutrix, I returned the rest of the answer as requested. And as Brent said you need to compile the regular expression only once. – Nadia Alramli Aug 24 '09 at 18:27
@Nadia, can you please look into John Machin's answer. He mentions there is something amiss with your solution. – X10 Aug 25 '09 at 12:43

KingRadical · Answer 5 · 2017-01-30T21:06:26.493

This has already been answered satisfactorily, but for people coming across this after the fact, I have done some profiling of several different methods of accomplishing this. In my case I wanted uppercase hex digits, so modify as necessary to suit your needs.

Here are my test implementations:

import re

hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')

def test_set(input):
    return set(input) <= hex_digits

def test_not_any(input):
    return not any(c not in hex_digits for c in input)

def test_re_match1(input):
    return bool(re.compile(r'^[A-F0-9]+\Z').match(input))

def test_re_match2(input):
    return bool(hex_match.match(input))

def test_re_match3(input):
    return bool(re.match(r'^[A-F0-9]+\Z', input))

def test_re_search1(input):
    return not bool(re.compile(r'[^A-F0-9]').search(input))

def test_re_search2(input):
    return not bool(hex_search.search(input))

def test_re_search3(input):
    return not bool(re.match(r'[^A-F0-9]', input))

And the tests, in Python 3.4.0 on Mac OS X:

import cProfile
import pstats
import random

# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]

# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()

# run the test functions against each item in tests. 
# this takes a little time; be patient
for t in tests:
    for tf in [test_set, test_not_any, 
               test_re_match1, test_re_match2, test_re_match3,
               test_re_search1, test_re_search2, test_re_search3]:
        _ = tf(t)

# stop collecting stats
test_pr.disable()

# we create our own pstats.Stats object to filter 
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)

# normally, stats are printed with the format %8.3f, 
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
    return "%11.6f" % x

def _print_title(self):
    print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
    print('filename:lineno(function)', file=self.stream)

pstats.f8 = _f8
pstats.Stats.print_title = _print_title

# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")

which gave the following results:

         50335004 function calls in 13.428 seconds

   Ordered by: cumulative time, function name
   List reduced from 20 to 8 due to restriction 

   ncalls     tottime     percall     cumtime     percall filename:lineno(function)
    10000    0.005233    0.000001    0.367360    0.000037 :1(test_re_match2)
    10000    0.006248    0.000001    0.378853    0.000038 :1(test_re_match3)
    10000    0.010710    0.000001    0.395770    0.000040 :1(test_re_match1)
    10000    0.004578    0.000000    0.467386    0.000047 :1(test_re_search2)
    10000    0.005994    0.000001    0.475329    0.000048 :1(test_re_search3)
    10000    0.008100    0.000001    0.482209    0.000048 :1(test_re_search1)
    10000    0.863139    0.000086    0.863139    0.000086 :1(test_set)
    10000    0.007414    0.000001    9.962580    0.000996 :1(test_not_any)

where:

ncalls: The number of times that function was called
tottime: the total time spent in the given function, excluding time made to sub-functions
percall: the quotient of tottime divided by ncalls
cumtime: the cumulative time spent in this and all subfunctions
percall: the quotient of cumtime divided by primitive calls

The columns we actually care about are cumtime and percall, as that shows us the actual time taken from function entry to exit. As we can see, regex match and search are not massively different.

It is faster not to bother compiling the regex if you would have compiled it every time. It is about 7.5% faster to compile once than every time, but only 2.5% faster to compile than to not compile.

test_set was twice as slow as re_search and thrice as slow as re_match

test_not_any was a full order of magnitude slower than test_set

TL;DR: Use re.match or re.search

`hex_match = re.compile(r'^[A-F0-9]+$')` matches "F00BAA\n" ... use `\Z` instead of `$` — John Machin, Jan 29 '17 at 22:24
$ matches *before* the \n: `>>> re.match(r'^[A-F0-9]+$', 'F00BAA\n').group(0)'` `<<< 'F00BAA'`. Using `\Z` is only preferable if you explicitly want the match to fail if there's a newline at the end — KingRadical, Jan 30 '17 at 20:10
Read the 2nd line of the OP's question: "and no other character" -- this calls for `\Z` — John Machin, Jan 30 '17 at 21:04

score 3 · Answer 6 · answered Jan 10 '19 at 07:36

Use python Sets when you need to compare hm... sets of data. Strings can be represented as sets of characters quite fast. Here I test if string is allowed phone number. First string is allowed, second not. Works fast and simple.

In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()

Out[17]: 0.8106249139964348

In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 фыв');p.issubset(allowed)").timeit()

Out[18]: 0.9240323599951807

Never use regexps if you can avoid them.

score 1 · Answer 7 · answered Sep 22 '22 at 14:35

1

allowed_characters = 'hsjwnbs#'
def isValidName(string,allowed_chars):
  allowed_chars = set((allowed_chars))
  validation = set((string))
  return validation.issubset(allowed_chars)

answered Sep 22 '22 at 14:35

Yaver Javid

68
1
8

score 0 · Answer 8 · answered Mar 12 '20 at 15:35

A different approach, because in my case I needed to also check whether it contained certain words (like 'test' in this example), not characters alone:

input_string = 'abc test'
input_string_test = input_string
allowed_list = ['a', 'b', 'c', 'test', ' ']

for allowed_list_item in allowed_list:
    input_string_test = input_string_test.replace(allowed_list_item, '')

if not input_string_test:
    # test passed

So, the allowed strings (char or word) are cut from the input string. If the input string only contained strings that were allowed, it should leave an empty string and therefore should pass if not input_string.

this goes through whole text for every allowed string making it O(n*k) time. If you're dealing with big texts, you should change it so it loops over its characters only once, making it O(n) — Bob Bobster, Jun 30 '21 at 13:17

In Python, how to check if a string only contains certain characters?

8 Answers8

Linked

Related