This has already been answered satisfactorily, but for people coming across this after the fact, I have done some profiling of several different methods of accomplishing this. In my case I wanted uppercase hex digits, so modify as necessary to suit your needs.
Here are my test implementations:
import re
hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')
def test_set(input):
return set(input) <= hex_digits
def test_not_any(input):
return not any(c not in hex_digits for c in input)
def test_re_match1(input):
return bool(re.compile(r'^[A-F0-9]+\Z').match(input))
def test_re_match2(input):
return bool(hex_match.match(input))
def test_re_match3(input):
return bool(re.match(r'^[A-F0-9]+\Z', input))
def test_re_search1(input):
return not bool(re.compile(r'[^A-F0-9]').search(input))
def test_re_search2(input):
return not bool(hex_search.search(input))
def test_re_search3(input):
return not bool(re.match(r'[^A-F0-9]', input))
And the tests, in Python 3.4.0 on Mac OS X:
import cProfile
import pstats
import random
# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]
# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()
# run the test functions against each item in tests.
# this takes a little time; be patient
for t in tests:
for tf in [test_set, test_not_any,
test_re_match1, test_re_match2, test_re_match3,
test_re_search1, test_re_search2, test_re_search3]:
_ = tf(t)
# stop collecting stats
test_pr.disable()
# we create our own pstats.Stats object to filter
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)
# normally, stats are printed with the format %8.3f,
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
return "%11.6f" % x
def _print_title(self):
print(' ncalls tottime percall cumtime percall', end=' ', file=self.stream)
print('filename:lineno(function)', file=self.stream)
pstats.f8 = _f8
pstats.Stats.print_title = _print_title
# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")
which gave the following results:
50335004 function calls in 13.428 seconds
Ordered by: cumulative time, function name
List reduced from 20 to 8 due to restriction
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.005233 0.000001 0.367360 0.000037 :1(test_re_match2)
10000 0.006248 0.000001 0.378853 0.000038 :1(test_re_match3)
10000 0.010710 0.000001 0.395770 0.000040 :1(test_re_match1)
10000 0.004578 0.000000 0.467386 0.000047 :1(test_re_search2)
10000 0.005994 0.000001 0.475329 0.000048 :1(test_re_search3)
10000 0.008100 0.000001 0.482209 0.000048 :1(test_re_search1)
10000 0.863139 0.000086 0.863139 0.000086 :1(test_set)
10000 0.007414 0.000001 9.962580 0.000996 :1(test_not_any)
where:
- ncalls
- The number of times that function was called
- tottime
- the total time spent in the given function, excluding time made to sub-functions
- percall
- the quotient of tottime divided by ncalls
- cumtime
- the cumulative time spent in this and all subfunctions
- percall
- the quotient of cumtime divided by primitive calls
The columns we actually care about are cumtime and percall, as that shows us the actual time taken from function entry to exit. As we can see, regex match and search are not massively different.
It is faster not to bother compiling the regex if you would have compiled it every time. It is about 7.5% faster to compile once than every time, but only 2.5% faster to compile than to not compile.
test_set was twice as slow as re_search and thrice as slow as re_match
test_not_any was a full order of magnitude slower than test_set
TL;DR: Use re.match or re.search