How to find characters that cannot be part of a word in a Unicode string?

Question

I have some text in a string, I need to check whether that particular string contains any characters which are not allowed to make a word.

Suppose I have text like "(hello}"

Here it contains to symbols '(' and '}'. How could I do it in C++. And a string may contains any unicode character.

what are the characters that are "allowed to make a word"? just letters? numbers? underscore? space? punctuation marks? for just letters see [`isalpha`](http://www.cplusplus.com/reference/clibrary/cctype/isalpha/) or [`iswalpha`](http://pubs.opengroup.org/onlinepubs/007908799/xsh/iswalpha.html). — Vlad, Jun 23 '11 at 14:30
Please [don't add signatures or taglines to your posts](http://www.stackoverflow.com/faq#signatures). — user229044, Jun 23 '11 at 14:34
Either I did not get your question correctly, or you are looking for regular expressions in C++. If that is true, look at this thread: http://stackoverflow.com/questions/181624/c-what-regex-library-should-i-use?answertab=votes#tab-top — Ozair Kafray, Jun 23 '11 at 14:35
boost::regex, libpcre, or just simple strspn/strcspn, strpbrk... — sehe, Jun 23 '11 at 14:36
Can you elaborate further on which characters "are allowed to make a word"? — Mark B, Jun 23 '11 at 14:41
Nope I don't wants regex. I just need to find a way to get whether this character is valid for making any word or not. — Vivek Kumar, Jun 23 '11 at 14:50
"petróleo" this is valid spanish word. if I use isalpha or iswalpha then it fails here for "ó" character. — Vivek Kumar, Jun 23 '11 at 14:54
@dearvivekkumar If you use a Spanish locale, then `std::isalpha` should work properly. Are you really trying to see if something is a word *in any language at all*? — Mark B, Jun 23 '11 at 15:05

score 4 · Accepted Answer · answered Jun 23 '11 at 14:56

If the string really contains Unicode (UTF-8), the problem is decidedly non-trivial; you'll probably want to use some external library, like ICU. Or you can convert to wchar_t (wstring), and use the single byte encoding solution below:

If the characters are single byte encoded, std::find_if with a suitable predicate should do the trick. If you're doing any text parsing, you'll want to define as set of such predicates, once and for all; the predicates can use the functions in the std::ctype facet of locale, or the ones in wctype.h (which use the global locale).

Still, if you are dealing with Unicode, even converting to wide characters may not be enough, since full Unicode can still use more than one code point to represent a single character. The real question is just how serious you want to do this. (Note too that in many languages, like English or French, "words" can contain characters which Unicode considers punctuation, e.g. "don't" or "aujourd'hui"—the Unicode tables will tell you that '\'' is punctuation, not part of a word.)

May be I am not explaining my need correctly. I have to parse a file to make word from that. that word can be in any language. i have checked for space for separating words from the text. But that does not seem to be sufficient. As suppose I have a text like "my name (vivek " then here i to three words my, name and vivek. I need to some general solution to this. I need idea how to handle this things. — Vivek Kumar, Jun 23 '11 at 15:03
Well, the biggest problem is to define what characters can or cannot be in a word. Do you accept "don't" as a word? Then you'll have to add some special case to your predicate. Do you accept an 'x' with a circumflex accent in a word? If so, even with wide characters, you'll have to handle multicode sequences (so you can't use `find_if`. The same thing holds if you can't guarantee a specific canonized format in your input. — James Kanze, Jun 23 '11 at 16:28

score 1 · Answer 2 · answered Jun 23 '11 at 14:50

1

std::isalpha (and related is* friends) are templated on the character type AND accept a locale to allow better localization ability too. I would just iterate over the string or wstring and use the is* function(s) that indicate the behavior you're interested in (I can't tell which characters you want to allow and disallow from the problem statement).

answered Jun 23 '11 at 14:50

Mark B

95,107
10
109
188

For basic purposes, that would be my recommendation, too. For hardcore no-scripts-barred Unicode processing something like ICU would probably be better, but it totally depends on your situation. Locale-aware regexes sound pretty sexy. – Kerrek SB Jun 23 '11 at 22:51

red1ynx · Answer 3 · 2011-06-23T14:57:28.500

0

Use std::wstring and std::iswalpha().

edited Jun 23 '11 at 14:57

answered Jun 23 '11 at 14:50

red1ynx

3,639
1
18
23

How to find characters that cannot be part of a word in a Unicode string?

3 Answers3