2

Env: boost1.53.0 c++11;

New to c++.

In boost locale boundary analysis, the rule type is specified for word(eg.boundary::word_letter, boundary::word_number) and sentence , but there is no boundary rule type for character. All I want is something like isUpperCase(), isLowerCase(), isDigit(), isPunctuation().

Tried boost string algorithm which didn't work.

boost::locale::generator gen;
std::locale loc = gen("ru_RU.UTF-8");
std::string context = "ДВ";
std::cout << boost::algorithm::all(context, boost::algorithm::is_upper(loc));

Why these features can be accessed easily in Java or python but so so confusing in C++? Any consist way to achieve these?

Tilney
  • 318
  • 2
  • 17
  • what do you mean by "boost string algorithm which didn't work", your program crashes? – user1 Dec 29 '14 at 02:56
  • = =! It doesn't work as expected. Wrong result. It can only handle ascii letter. Thanks again~ – Tilney Dec 29 '14 at 02:57
  • Which operating system? What is the code page your source file is saved in? – user1 Dec 29 '14 at 02:58
  • unbuntu 12.04. Everything is encoded with utf8. – Tilney Dec 29 '14 at 02:59
  • Take a look at program in the question, http://stackoverflow.com/questions/27614666/print-all-stdlocale-names-windows/27615711#27615711. Very similar to what you are attempting and it works fine. Just change it according to your program, change locale ofcourse, see if it works – user1 Dec 29 '14 at 03:02
  • Are 100% sure about the locale "ru_RU.UTF-8". How did you find out the correct locale. You must use locale -a command to find out all locales supported and then chose appropriately. – user1 Dec 29 '14 at 03:07
  • I tried `locale -a`, and find out `ru_RU.UTF-8` is not installed. After `locale-gen ru_RU.UTF-8`, recompile and run, still no luck! – Tilney Dec 29 '14 at 03:27
  • If a locale name is not right. A bad_cast exception would be thrown. – Tilney Dec 29 '14 at 03:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/67822/discussion-between-user1-and-tilney). – user1 Dec 29 '14 at 03:31

2 Answers2

2

This works for me under VS 2013.

locale::global(locale("ru-RU")); 
std::string context = "ДВ"; 
std::cout << any_of(context.begin(), context.end(), boost::algorithm::is_upper());

Prints 1

It is important how you initialize the locale.

UPDATE:

Here's solution which will work under Ubuntu.

#include <iostream>

#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/predicate.hpp>
#include <boost/locale.hpp>

using namespace std;

int main()
{
    locale::global(locale("ru_RU"));

    wstring context = L"ДВ";
    wcout << boolalpha << any_of(context.begin(), context.end(), boost::algorithm::is_upper());

    wcout<<endl;

    wstring context1 = L"ПРИВЕТ, МИР"; //HELLO WORLD in russian
    wcout << boolalpha << any_of(context1.begin(), context1.end(), boost::algorithm::is_upper());

    wcout<<endl;

    wstring context2 = L"привет мир"; //hello world in russian
    wcout << boolalpha << any_of(context2.begin(), context2.end(), boost::algorithm::is_upper());

    return 0;
}

Prints

true
true
false

This will work with boost::algorithm::all as well.

wstring context = L"ДВ";
wcout << boolalpha << boost::algorithm::all(context, boost::algorithm::is_upper());
user1
  • 4,031
  • 8
  • 37
  • 66
  • It works for me finally. In fact, this is part of my named entity tagging project. From sentence segmentation, tokenization to named entity tagging, the whole pipeline is uniformed in `unicode string` and the basic character unit is a utf8 char. And according to http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring , `wstring` is preferred under windows, but not with linux. Anyway, if there is no other option, then it has to be this way. And thanks for everything! – Tilney Dec 29 '14 at 06:03
  • I post another way blow to achieve this. Check it out if interested. – Tilney Dec 29 '14 at 09:43
1

Boost.locale is based on ICU and ICU itself did provide character level classification, which seems pretty consist and readable(more of Java-style).

Here is a simple example.

#include <unicode/brkiter.h>
#include <unicode/utypes.h>
#include <unicode/uchar.h>

int main()
{
UnicodeString s("А аБ Д д2 -");
UErrorCode status = U_ERROR_WARNING_LIMIT;
Locale ru("ru", "RU");
BreakIterator* bi = BreakIterator::createCharacterInstance(ru, status);
bi->setText(s);
int32_t p = bi->first();
while(p != BreakIterator::DONE) {
    std::string type;
    if(u_isUUppercase(s.charAt(p)))
        type = "upper" ;
    if(u_isULowercase(s.charAt(p)))
        type = "lower" ;
    if(u_isUWhiteSpace(s.charAt(p)))
        type = "whitespace" ;
    if(u_isdigit(s.charAt(p)))
        type = "digit" ;
    if(u_ispunct(s.charAt(p)))
        type = "punc" ;
    printf("Boundary at position %d is %s\n", p, type.c_str());
    p= bi->next();
}
delete bi;
return 0;

}

Tilney
  • 318
  • 2
  • 17