If the string really contains Unicode (UTF-8), the problem is decidedly
non-trivial; you'll probably want to use some external library, like
ICU. Or you can convert to wchar_t
(wstring
), and use the single
byte encoding solution below:
If the characters are single byte encoded, std::find_if
with a
suitable predicate should do the trick. If you're doing any text
parsing, you'll want to define as set of such predicates, once and for
all; the predicates can use the functions in the std::ctype
facet of
locale
, or the ones in wctype.h
(which use the global locale).
Still, if you are dealing with Unicode, even converting to wide
characters may not be enough, since full Unicode can still use more than
one code point to represent a single character. The real question is
just how serious you want to do this. (Note too that in many languages,
like English or French, "words" can contain characters which Unicode
considers punctuation, e.g. "don't" or "aujourd'hui"—the Unicode
tables will tell you that '\''
is punctuation, not part of a word.)