What is the fastest way to decide if a digit appears in a string?

Question

This simple solution popped into my mind quickly.

#include <ctype.h>

int digit_exists_in
(
    const char *s
)
{
    while (*s)
    {
        if (isdigit(*s))
        {
            return 1;
        }
        else
        {
            s++;
        }
    }

    return 0;
}

int main(void)
{
    int foundDigit = digit_exists_in("abcdefg9ijklmn");

    return 0;
}

What other techniques could be applied to get better speed?

The actual strings being searched are variable length, and the characters themselves are ASCII, not the full character set. The strings are NUL terminated.

NickZoic · Accepted Answer · 2009-04-24T00:54:19.650

liw.fi is right, by the way. I was a bit surprised by this, since strcspn has to do a much more general problem than the isdigit() approach, but it seems to be the case:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>

#define NTESTS 10000
#define TESTSIZE 10000

char stest1[TESTSIZE];
char stest2[TESTSIZE];

int test_isdigit(char *s) {
    while (*s) {
        if (isdigit(*s)) return 1;
        s++;
    }
    return 0;
}

int test_range(char *s) {
    while (*s) {
        if ((*s >= '0') && (*s <= '9')) return 1;
        s++;
    }
    return 0;
}

int test_strcspn(char *s) {
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(int argc, char **argv) {
    long int i;
    for (i=0; i<TESTSIZE; i++) {
        stest1[i] = stest2[i] = 'A' + i % 26;
    }
    stest2[TESTSIZE-1] = '5';

    int alg = atoi(argv[1]);

    switch (alg) {
        case 0:        
            printf("Testing strcspn\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_strcspn(stest1) == 0);
                assert(test_strcspn(stest2) != 0);
            }
            break;
        case 1:
            printf("Testing isdigit() loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_isdigit(stest1) == 0);
                assert(test_isdigit(stest2) != 0);
            }
            break;
        case 2:
            printf("Testing <= => loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_range(stest1) == 0);
                assert(test_range(stest2) != 0);
            }
            break;
        default:
            printf("eh?\n");
            exit(1);
    }    

    return 0;
}

It's awfully hard to beat the standard libraries at their own game ... (the usual caveats apply -- YMMV)

$ gcc -O6 -Wall -o strcspn strcspn.c 

$ time ./strcspn 0
Testing strcspn

real    0m0.085s
user    0m0.090s
sys 0m0.000s

$ time ./strcspn 1
Testing isdigit() loop

real    0m0.753s
user    0m0.750s
sys 0m0.000s

$ time ./strcspn 2
Testing <= => loop

real    0m0.247s
user    0m0.250s
sys 0m0.000s

UPDATE: Just for fun, I added a bitmap lookup version based on Mike Dunlavey's answer:

char bitmap[256] = {
        /* 0x00 */ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x10 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x20 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x30 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
};

int test_bitmap(char *s) {
    while (!bitmap[*(unsigned char *)s]) s++;
    return (*s);
}

Which slightly outperforms the others (~.170s) but still can't touch strcspn!

I get similar results, FWIW. test_range() and test_digit() execute almost identically, with strcspn being the clear 'winner'. — Nick Presta, Apr 16 '09 at 02:35
Here on OS X (Leopard), test_digit() performs pretty awfully (2.281 seconds real time), whereas test_range() performs alright (0.686 seconds) and test_strcspn() is still the clear winner (0.275 seconds). The system time for all of them is 0.003 to 0.006, but I notice the lag time for test_isdigit(). — Chris Lutz, Apr 16 '09 at 06:26
Ya. Using numbers and facts to back up your argument is cheating +1 — EvilTeach, Apr 16 '09 at 16:19
Very well done, actual benchmarking is always useful. I doubt there's a huge variation between compilers/architectures in this, but just in case: anyone wanting to use this code should run the benchmark themselves. — , Apr 17 '09 at 07:15
Some insight into why strcpsn is faster: Apparently it uses a lookup table (i.e. a boolean value for all 256 possible characters) instead of >= and <=. (Maybe they have some low-level assembly hacks too :P) — v3., Apr 23 '09 at 22:02
glibc `strcspn` on x86-64 just does the same lookup-table, with the loop unrolled by 4. (And false dependencies between the loads, costing extra ALU uops...) https://code.woboq.org/userspace/glibc/sysdeps/x86_64/strcspn.S.html. Apparently POWER8 has SIMD instructions that can usefully do stuff with a 256-bit bitmap, allowing faster scanning after building one even for the general case of `strcspn`. — Peter Cordes, Jun 04 '21 at 09:22
But for this specific problem, you can do *much* better with x86-specific intrinsics to check 16 bytes at a time, using either SIMD packed `c > ('0'-1)` and-not `(c > '9'), or maybe a `c - '0' <= (unsigned)'9'` range-check: [Convert a String In C++ To Upper Case](https://stackoverflow.com/a/37151084) shows how that can be done efficiently by range-shifting unsigned->signed along with the subtract. (SIMD for the `c-'a' <= (unsigned)25` ascii_islower check is exactly the same problem with different constants: `_mm_sub_epi8` and `_mm_cmpgt_epi8` -> `_mm_movemask_epi8(v) != 0`) — Peter Cordes, Jun 04 '21 at 09:29
https://www.reddit.com/r/simd/comments/myqv32/i_simply_implemented_and_practice_custom_string/ has a SIMD strcspn that's good for *large* inputs. (Probably worse for small inputs.) Also semi-related comments on code-review.SE on [Remove all unwanted characters from a string buffer, in-place](https://codereview.stackexchange.com/posts/comments/518290) — Peter Cordes, Jun 04 '21 at 09:42

score 13 · Answer 2 · answered Apr 15 '09 at 19:33

13

I would start by using the appropriate library function, strcspn, on the assumption that the library has been optimized with extreme prejudice:

#include <string.h>
#include <stdio.h>

int digit_exists_in(const char *s)
{
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(void)
{
    printf("%d\n", digit_exists_in("foobar"));
    printf("%d\n", digit_exists_in("foobar1"));
    return 0;
}

If the library hasn't been optimized sufficiently, it'd be a good idea to put the optimization into the library so everyone can benefit. (You have the source, right?)

answered Apr 15 '09 at 19:33

Interesting. I rejected strspn as the doc seemed to suggest that if the first character was not in the search set then the function would return 0. – EvilTeach Apr 16 '09 at 16:11
strspn return an integer value specifying the length of the substring in string that consists entirely of characters in strCharSet. If string begins with a character not in strCharSet, the function returns 0. No return value is reserved to indicate an error. – EvilTeach Apr 16 '09 at 16:14
I will revisit strspn and strcspn. +1 – EvilTeach Apr 16 '09 at 16:15

score 10 · Answer 3 · answered Apr 15 '09 at 18:43

10

There isn't a faster algorithm, but you can look at a full register's worth of bytes per instruction or use SIMD operations to speed things up. You can either use a mask and zero test to see if it is even possible if there are any digits in a range, or if your SIMD operations are fast enough on large enough vectors, you could do iterate through tests for specific numerical values in a vector of bytes faster than doing character compares.

So, for example, you could do something like:

byte buffer[8] = { 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30 };
uint64 *mask = (uint64 *) buffer; //this is just for clarity

if (*((uint64 *) s) & *mask) == 0)
    //You now don't need to do the < '0' test for the next 8 bytes in s

Some optimizers might be smart enough to do this for you just from your code sample above.

You had better be comparing a TON of bytes to be thinking about optimizations at this level though.

answered Apr 15 '09 at 18:43

Christopher Smith

5,372
1
34
18

This tells you if there are no chars 0..., P..., p... What do you do then? – Mike Dunlavey Apr 15 '09 at 19:49
1

Surely it tells you that there are no numbers in the current 8 numbers you're looking at, so you skip to the next 8 numbers. – justinhj Apr 15 '09 at 20:24
Oh, OK. Zero means there are no numbers in that set (and no letters >='P' or 'p'), but non-zero does not mean there are numbers. So this saves significant time if letters >= 'P' or 'p' are rarely used. – Mike Dunlavey Apr 15 '09 at 23:03
All digits have 30 in the left nibble. if you and them and compare it to the mask and they are equal, then it is a digit. – EvilTeach Apr 16 '09 at 19:29
@Evil: Understand, but that's not what this code does. It only tells you if the string is like "AbCdEfGh", i.e. only letters and only letters that come before P or p. Or am I missing something? – Mike Dunlavey Apr 16 '09 at 22:01
the digit '0' is 48 in decimal, 0x30 in hex. '1' is 0x31, '2' is 0x32... so if you take 0x32 & 0x30 that yields a 0x30 then compare it to 0x30 if they are equal, then it is a digit. – EvilTeach Apr 18 '09 at 01:20
Right, and if you could do 16 bytes at a time, then it'd be faster to have a specific mask for detecting each number, and do it 10 times, which would be faster than comparing each byte individually. You could also just have a second mask that sets an upper bound on the range, so at that point you'd already have a very high probability of finding a number when you looked. – Christopher Smith Apr 23 '09 at 03:36
You need `memcpy` to make that `uint64_t*` load safe wrt. alignment and strict-aliasing. (On ISAs where unaligned loads are legal, or also if the pointer is known to be aligned, it will inline to a single load instruction). Or with GNU C, `typedef unsigned long __attribute__((may_alias)) aliasing_ulong;` works (for aliasing; add `aligned(1)` to the attributes if you need that, too). See [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) – Peter Cordes Jun 04 '21 at 09:19
On CPUs with proper SIMD instructions, you can go much faster, e.g. with Intel intrinsics for a vectorized `c - '0' <= (unsigned)9` range check. [Convert a String In C++ To Upper Case](https://stackoverflow.com/a/37151084) includes an equivalent range-check for ASCII digits, using x86's signed-greater-than by flipping unsigned to signed. ( `v = _mm_sub_epi8(v, _mm_set1_epi8(128 + '0'))`. On x86 (where you can efficiently branch on a SIMD compare result bitmap), you can early-out without scanning the whole string. – Peter Cordes Jun 04 '21 at 09:47
Beware alignment with this: see [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) - yes, but you need to make sure your 8-byte load doesn't start within 8 bytes of the end of a page. – Peter Cordes Jun 04 '21 at 09:49

Brian R. Bondy · Answer 4 · 2009-04-15T19:07:09.677

7

Any algorithm will be O(N).

I assume isdigit is pretty efficient already.

edited Apr 15 '09 at 19:07

answered Apr 15 '09 at 18:23

Brian R. Bondy

339,232
124
596
636

1

At *least* O(N) :) It's easy to implement pessimizations and "smart" algorithms :) – Mihai Limbășan Apr 15 '09 at 18:25
Good point, you could have some less efficient algorithms, but they'd probably be much harder to implement. – Brian R. Bondy Apr 15 '09 at 18:28
In order to take advantage of short circuits, couldn't you write "!(char < '0') && !(char > '9')" – Kibbee Apr 15 '09 at 18:37
isdigit() is usually a table lookup, i.e. faster than two comparisons. And locale-safe to boot. ;-) – DevSolar Apr 15 '09 at 18:38
isdigit is typically a bitmask operation - and quicker than the two-part test. – Jonathan Leffler Apr 15 '09 at 18:41
+1 in that form - it is usually hard to beat the standard library in efficiency. – DevSolar Apr 16 '09 at 09:06

score 2 · Answer 5 · answered Apr 15 '09 at 18:24

2

Conceptually there is no faster way. This is assuming that you have a string which where the placement of digits is seemingly random. This forces you to search every item in the string for a digit hence front to back is as likely to find the digit first as any other search mechanism.

answered Apr 15 '09 at 18:24

JaredPar

733,204
149
1,241
1,454

ya, there is no pattern to the position of digits which would lend itself to an educated, or cache based guess. – EvilTeach Apr 15 '09 at 18:40

score 2 · Answer 6 · answered Apr 15 '09 at 18:35

2

You could go multithreaded, although that's probably adding too much complexity to the algorithm for something that's already pretty fast.

answered Apr 15 '09 at 18:35

Kibbee

65,369
27
142
182

Good point, but for all but the longest strings, this is going to be a pessimization due to the overhead of threads. – Apr 15 '09 at 18:41
You could also hurt yourself with threads due to horrible caching effects. For stuff like this one would hope the real limiter would be memory latency/bandwidth, and using multiple cores could potentially make it worse. – Christopher Smith Apr 15 '09 at 18:44
If the strings are big enough, sounds like a winner. – Milhous Apr 16 '09 at 03:02

score 2 · Answer 7 · answered Apr 15 '09 at 18:45

2

The real question is how important is it for this function to be optimal? I say leave the simple solution and profile it. You should only optimize it if it is causing a bottleneck in your code.

answered Apr 15 '09 at 18:45

Patrick McDonald

64,141
14
108
120

Mike Dunlavey · Answer 8 · 2009-04-17T21:12:17.733

Just for fun, maybe something allong the lines of:

 // accumulator
unsigned char thereIsADigit = 0;
// lookup table
unsigned char IsDigit[256] = {0,0,0 ..., 1,1,1,1,1,1,1,1,1,0,0,0 ...};

// an unrolled loop, something like:
thereIsADigit |= IsDigit[s[0]];
thereIsADigit |= IsDigit[s[1]];
thereIsADigit |= IsDigit[s[2]];
thereIsADigit |= IsDigit[s[3]];
thereIsADigit |= IsDigit[s[4]];
thereIsADigit |= IsDigit[s[5]];
thereIsADigit |= IsDigit[s[6]];
thereIsADigit |= IsDigit[s[7]];
if (thereIsADigit) break;
s += 8;

On the IBM 360 there was a "translate" instruction that could do this in one step.

OK, OK, Christopher Smith's answer got me thinking. Suppose you are only using 7-bit ASCII. Here's a way to do SIMD with wide-integer arithmetic.

Suppose C is a 32-bit word containing 4 characters.

 // compare by subtracting in 8-bit 2s complement arithmetic
( (C + ((0x3a3a3a3a ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char <= '9'
  &
  (0x2f2f2f2f + ((C ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char >= '0'
) // high bit is set for any char <= '9' and >= '0'
& 0x80808080 // look only at the high-order bits
// if any of these 4 bits is 1, there is a digit in C
// if the result is zero, there are no digits in C

It depends on the high-order bit of each character being initially zero, so that a carry into that bit will not propogate. (I'm sure this could be simplified.)

Ya, the input set is 7 bit ascii. – EvilTeach Mar 17 '10 at 20:56 — EvilTeach, Mar 17 '10 at 20:56

score 2 · Answer 9 · answered Apr 18 '09 at 04:03

Taking the test program and throwing my profiler at it yields the following.

      Count      %   % with           Time   Statement
                      child
-----------------------------------------------------------------------------------------------
                                             int test_isdigit(char *s)   
     20,000    0.0    0.0          2,160.4   {   
199,990,000   13.2    9.5     14,278,460.7       while (*s)  
                                                 {   
199,980,000   69.8   49.9     75,243,524.7            if (isdigit(*s)) return 1;  
199,970,000   17.0   12.1     18,312,971.5            s++;    
                                                 }   

     10,000    0.0    0.0          1,151.4       return 0;   
                                             }   

                                             int test_range(char *s)     
     20,000    0.0    0.0          1,734.2   {   
199,990,000   33.6    9.4     14,114,309.7       while (*s)  
                                                 {
199,980,000   32.2    9.0     13,534,938.6           if ((*s >= '0') && 
                                                        (*s <= '9')) return 1;   
199,970,000   34.2    9.5     14,367,161.9           s++;    
                                                 }   

     10,000    0.0    0.0          1,122.2       return 0;   
                                             }   

                                             int test_strcspn(char *s)   
     20,000    0.2    0.0          1,816.6   {   
     20,000   99.8    0.6        863,113.2       return s[strcspn(s, "0123456789")] 
                                                          == '0'; 
                                             }

strcspn does the job well enough. Looking at the asm code for it, I see it builds a bitmap of size 256 sets the bits based on the search characters, then processes the string.

The bit map gets built on the stack once for every call.

One other approach would be to build and keep the bit map, and reused it each time.

Another approach would be to do the operations in parallel using the techniques that Chris Smith talked about.

For the moment strcspn will suffice.

Tim · Answer 10 · 2009-04-15T19:39:52.233

1

if you really want to reduce the overhead time and you don't mind making it specific for char, then you can check the ascii values between 0 and 9 inclusive.

48 to 57 decimal

this removes a stack call.

I should have said lookup table as well...

edited Apr 15 '09 at 19:39

answered Apr 15 '09 at 18:26

Tim

20,184
24
117
214

That's exactly what isdigit does, and you can just make your own inline isdigit implementation. but yeah. – Apr 15 '09 at 18:30
Scratch that, the comments to Bondy's answer refute me. – Apr 15 '09 at 18:42

score 1 · Answer 11 · 2009-04-15T18:40:28.680

As others have said, you can't get below O(N).

I can think of a contrived scenario with logarithmic speed... Say you're writing a text editor and it needs a "does this file contain any digits" feature. You can keep a sorted array of all unique characters present in the file, update it on every keystroke, and query it with binary search. This probably lies outside the scope of your question though (and can be done in several better ways).

dwc · Answer 12 · 2009-04-16T04:10:59.307

1

Here's a version that may or may not be faster, but it handles a NULL pointer...

int digit_exists_in(const char *s)
{
    if (!s)
        return (0);
    while (*s)
        if (isdigit(*s++))
            return (1);
    return (0);
}

edited Apr 16 '09 at 04:10

answered Apr 16 '09 at 01:43

dwc

24,196
7
44
55

score 1 · Answer 13 · answered Apr 16 '09 at 02:28

Memory Prefetch

If your strings are very long, either get your compiler to do it or manually unroll your loop and drop in a memory prefetch instruction or two every cache-line.

That way while the CPU is scanning, the memory controller can be pulling in the next lines of data.

If you save the length of the string when you create it you can skip all the checks for the NUL byte, which means you can unroll your loop to operate in bigger chunks and reduce the number of compare and branch operations, although with current branch predictors, it honestly doesn't make much difference.

Even with great CPU branch predictors, the loop will slow down if it has to check the loop counter every time through the loop to decide when to do a memory prefetch so unrolling is still helpful in that case.

Profiling Feedback

For best performance the CPU does need to get the branches hinted correctly, and that's where profiling feedback comes in very handy. Otherwise the compiler is just making a somewhat educated guess.

score 0 · Answer 14 · answered Apr 15 '09 at 19:48

0

Have a human look at it. Humans can do this in O(1) time, we have MUCH larger word sizes than even modern processors.

That said, actual time would still be better with your method...what with the difference in cycle time between a modern core and a human brain.

answered Apr 15 '09 at 19:48

Jeff

2,835
3
41
69

1

Not to mention the caloric load (twinkies/invocation :) – Mike Dunlavey Apr 15 '09 at 23:10

score 0 · Answer 15 · answered Apr 15 '09 at 19:59

0

I may be wrong, but there may be a faster way.

Quicksort the string.

The serial search has a best time of O(1), a mean of O(1/2n) and worse case of O(n).

Quicksort has best O(log n), mean O(nlog n), worse case O(n^2).

The thing is, you can bail out of the quicksort as soon as you see a digit. If the quicksort actually completes, the digit will be at the beginning of the sorted string, so you'll find it then in O(1).

The achievement here it to shift the best, mean and worse case behaviours. A quicksort will have worse worst case behaviour, but better mean behaviour.

answered Apr 15 '09 at 19:59

Good thinking. I prefer merge sort, which is always NlogN. Of course, that's only logN times slower than N :) – Mike Dunlavey Apr 15 '09 at 23:07
Would it be quicker to use XML ? ;-) – Martin Beckett Apr 17 '09 at 20:56
If you're going to BS, do it properly. Anyone can see 1/2 n is better than n log n. – Matthew Flaschen Apr 18 '09 at 05:27
@Matthew - O(nlog n) is the mean for a *completed* quicksort. However, in this method, the quicksort usually does not need to complete. – Apr 18 '09 at 07:02

score 0 · Answer 16 · answered Apr 16 '09 at 00:55

0

Of course, you can sacrifice accuracy for speed:

int digit_exists_in(const char* s)
{
    return 0;
}

This algorithm has a complexity of O(1) and an approximacity of O((246/256)^N).

answered Apr 16 '09 at 00:55

aib

45,516
10
73
79

Agreed on the shield! The algorithm can be improved to approximacity of O((246/256)^(N-1)) with no increase in complexity (in big-O) by replacing the "return 0" with "return isdigit(*s)" – DocMax Apr 17 '09 at 21:34
Thanks, it was so I couldn't gain any reputation from a joke. (Though now I think it works both ways.) – aib Apr 19 '09 at 00:17

What is the fastest way to decide if a digit appears in a string?

16 Answers16

Memory Prefetch

Profiling Feedback