4

I was looking for a trivial way to check if a type std::string contains only alphanumeric characters. My code looks like the following:

   std::string sStr("This is a test");
   for (std::string::const_iterator s = sStr.begin(); s != sStr.end(); ++s)
              if (! isalnum(*s)) return false;

Is this the right way of doing this? Is there a more efficient way of handling the problem?

randombits
  • 47,058
  • 76
  • 251
  • 433
  • 1
    Fundamentally, every character in the string must be checked if it is in the range of an alphanumeric character. You may be able to improve efficiency by using some looping constructs in ``, but I don't think they would be significant. – Thomas Matthews May 02 '16 at 18:19
  • 1
    You forgot to state if you want to use different locales or plain ASCII? – BJovke May 03 '16 at 10:20

3 Answers3

6

I would use std::find_if:

std::string sStr("This is a test");

auto it = std::find_if(std::begin(sStr), std::end(sStr), [](unsigned char c) {
    return !std::isalnum(c);
});

return it == std::end(sStr);

std::find_if_not maybe would make more sense:

auto it = std::find_if_not(std::begin(sStr), std::end(sStr), [](unsigned char c) {
    return std::isalnum(c);
});
Rakete1111
  • 47,013
  • 16
  • 123
  • 162
  • 1
    Maybe more elegant, but I doubt it's more efficient. – Frank Puffer May 02 '16 at 18:07
  • just curious, what is the last argument `[](char c)` doing here, not totally understanding the syntax. thanks for posting a better solution, though. – randombits May 02 '16 at 18:07
  • @randombits that's a lambda, it's like a function, but anonymous, you don't have to declare it separately. You can read more here: http://stackoverflow.com/questions/7627098/what-is-a-lambda-expression-in-c11 – Rakete1111 May 02 '16 at 18:08
  • @FrankPuffer I think there isn't a better way (correct me if I am wrong), but you have to loop over every character, which is the worst case for `std::find_if`, because it will terminate early if it has found a "matching" character. – Rakete1111 May 02 '16 at 18:11
  • so in this case, `it` is equal to true if it runs into a character that is NOT alphanumeric? – randombits May 02 '16 at 18:12
  • @randombits `std::find_if` searches for a character that is NOT alphanumeric. If it has found one, it will return an iterator to the not alphanumeric character. It it didn't find one, it will return the end iterator, so I'm comparing it to `it`, so it will return true if there are only alphanumeric characters – Rakete1111 May 02 '16 at 18:14
  • 3
    `std::any_of` or `std::all_of` would make more sense. – juanchopanza May 02 '16 at 18:22
  • As per docs: https://en.cppreference.com/w/cpp/string/byte/isalnum "To use these functions safely with plain chars (or signed chars), the argument should first be converted to unsigned char". – Mecanik Jan 14 '19 at 14:50
  • 1
    @Rakete1111 No worries, also made 2 examples using your code for those who don't know how to use this: https://ideone.com/WiERqn – Mecanik Jan 14 '19 at 14:56
6

Yes, actually*. The looping constructs found in <algorithm> appear to perform slightly better than a raw loop under optimization, at least with gcc.

Using <algorithm> and a lambda is a nice way of doing this, at first glance:

bool onlyAlnum(const std::string& str){
    return std::all_of(
        str.begin(), 
        str.end(), 
        [loc = std::locale{}](char c){return std::isalnum(c, loc);});
}

However this has its disadvantages.

locale: the locale version of isalnum seemed to be quite a bit slower than the <cctype> incarnation of the function when I went to test it. the cctype version doesn't pay attention to locale, but testing single char for whether they're alphanumeric is going to work for a tiny, tiny subset of UTF-8 characters anyway: UTF-8 is a variable-width encoding, and testing a part of a multi-char character will result in a false negative for a test of alphanumerism.

The lambda above is a c++14 lambda, which initializes a variable loc to hold the locale when it's first created. This allows the function to work dependent on the present locale, while also preventing the cost of constructing a new object to represent the locale every time the predicate is evaluated, as would happen with a lambda like:

[](char c){return std::isalnum(c, std::locale{});}

However, it's still a very slow test, in comparison. If we don't need the limited advantages of the <locale> incarnation of std::isalnum, we can use the (much) faster <cctype> version:

[](char c){return std::isalnum(c);}

So we come to a second way of doing this, which uses the cctype version instead. Testing shows this to be much faster, on par with the raw loop you gave:

bool onlyAlnumAllOf(const std::string& str){
    return std::all_of(
        str.begin(), 
        str.end(), 
        [](char c){return std::isalnum(c);});
}

all_of tests whether a condition is valid for every entry in a range of input iterators. The range is supplied by the first two arguments, here str.begin() and str.end(), which naturally define the start and end of the string.

and a demo on coliru shows that onlyAlNum will return true for any string containing only characters from the alphabet or numbers, though not anything containing a space.

In the end, you can test the difference. With a cursory test evaluating "oyn3478qo47nqooina7o8oao7nroOL" 1000000 times, these are the results:

MinGW-64 port of gcc 5.2.0 on my machine

g++ main.cpp -Wall -Wextra -Wpedantic --std=c++14 -o3

all_of (with locale information): 652ms for 1000000 iterations
all_of: 63ms for 1000000 iterations
find_if: 63ms for 1000000 iterations
loop: 70ms for 1000000 iterations
range-loop: 69ms for 1000000 iterations

and coliru with gcc 6.1.0:

g++ main.cpp -Wall -Wextra -Wpedantic --std=c++14 -o3

all_of (with locale information): 1404ms for 1000000 iterations
all_of: 101ms for 1000000 iterations
find_if: 110ms for 1000000 iterations
loop: 108ms for 1000000 iterations
range-loop: 119ms for 1000000 iterations

and clang 3.8.0 on coliru:

clang++ -std=c++14 -O3 -Wall -Wextra -Wpedantic main.cpp

all_of (with locale information): 1127ms for 1000000 iterations
all_of: 85ms for 1000000 iterations
find_if: 72ms for 1000000 iterations
loop: 128ms for 1000000 iterations
range-loop: 88ms for 1000000 iterations

As you can see, it varies by compiler and version which function is the fastest. Isn't optimization fun?

this is the function I used to test each method:

using StrEvaluator = bool (*)(const std::string&);
using Milliseconds = std::chrono::milliseconds;

void testStrEvaluator(StrEvaluator eval, std::string str){
    auto begin = std::chrono::steady_clock::now();
    bool result = true;
    for(unsigned int i = 0; i < 1000000; ++i){
        str.resize(str.size());
        result &= eval(str);
    }
    auto end = std::chrono::steady_clock::now();
    std::cout
        << std::chrono::duration_cast<Milliseconds>(end - begin).count()
        << "ms for 1000000 iterations\n";
}

There are flaws with the testing: coliru doesn't provide guarantees regarding consistent resources during execution, and I didn't shut down other programs on my computer, so the variations could be a fluke. However, they seem consistent enough to draw a few conclusions from: both the looping constructs from algorithm and raw loops can perform quite well, and choosing between them based on speed (unless you've discovered the loop is a bottleneck) is more micro-optimization than anything else.

jaggedSpire
  • 4,423
  • 2
  • 26
  • 52
1

You can accomplish this in a couple of superficially different ways, but the efficient solution remains essentially what you have already written. Any solution will necessarily have to check characters one by one, until either all have been found to be alphanumeric, or a non-alphanumeric has been found.

Actually, that's not quite true. If your strings are very long, you may benefit from parallelism. But I suspect that this is not the case.

As far as style goes, I would suggest using range for (if you have C++11); otherwise, I'd write what you have.

alcedine
  • 909
  • 5
  • 18