Find substring in string using locale

Question

I need to find if a string contains a substring, but according to the current locale's rules.

So, if I'm searching for the string "aba", with the Spanish locale, "cabalgar", "rábano" and "gabán" would all three contain it.

I know I can compare strings with locale information (collate), but is there any built-in or starightforward way to do the same with find, or do I have to write my own?

I'm fine using std::string (up to TR1) or MFC's CString

You have to write your own (or get a third party to do it for you). — john, Sep 26 '13 at 07:32
Maybe relevant: http://stackoverflow.com/a/144804/85371 and Boost Locale (http://www.codeproject.com/Questions/595935/Howplustopluscompareplusunicodeplusstringsplusigno) — sehe, Sep 26 '13 at 10:21
Those rules are not there. The standard collation for Spanish locales distinguishes accents. Under those rules, "rábano" does not contain "aba". What you want are *your rules*, so you have to write them yourself. A lazy implementation would start by decomposing the string (normalize to form D) and then removing all non-starter characters. That's too blunt, but works for your examples with Spanish. For other languages you'll need to be more selective on which non-starters to drop. — R. Martinho Fernandes, Sep 26 '13 at 10:21

score 3 · Answer 1 · answered Sep 28 '14 at 19:12

For reference, here is an implementation using boost locale compiled with ICU backend:

#include <iostream>
#include <boost/locale.hpp>

namespace bl = boost::locale;

std::locale usedLocale;

std::string normalize(const std::string& input)
{
    const bl::collator<char>& collator = std::use_facet<bl::collator<char> >(usedLocale);
    return collator.transform(bl::collator_base::primary, input);
}

bool contain(const std::string& op1, const std::string& op2){
    std::string normOp2 = normalize(op2);

    //Gotcha!! collator.transform() is returning an accessible null byte (\0) at
    //the end of the string. Thats why we search till 'normOp2.length()-1'
    return  normalize(op1).find( normOp2.c_str(), 0, normOp2.length()-1 ) != std::string::npos;
}

int main()
{
    bl::generator generator;
    usedLocale = generator(""); //use default system locale

    std::cout << std::boolalpha
                << contain("cabalgar", "aba") << "\n"
                << contain("rábano", "aba") << "\n"
                << contain("gabán", "aba") << "\n"
                << contain("gabán", "Âbã") << "\n"
                << contain("gabán", "aba.") << "\n"
}

Output:

true
true
true
true
false

score 1 · Answer 2 · answered Sep 26 '13 at 07:35

1

You could loop over the string indices, and compare a substring with the string you want to find with std::strcoll.

answered Sep 26 '13 at 07:35

Some programmer dude

400,186
35
402
621

sehe · Answer 3 · 2013-09-26T10:29:22.957

I haven't used this before, but std::strxfrm looks to be what you could use:

http://en.cppreference.com/w/cpp/locale/collate/transform

#include <iostream>
#include <iomanip>
#include <cstring>

std::string xfrm(std::string const& input)
{
    std::string result(1+std::strxfrm(nullptr, input.c_str(), 0), '\0');
    std::strxfrm(&result[0], input.c_str(), result.size());

    return result;
}

int main()
{
    using namespace std;
    setlocale(LC_ALL, "es_ES.UTF-8");

    const string aba    = "aba";
    const string rabano = "rábano";

    cout << "Without xfrm: " << aba << " in " << rabano << " == " << 
        boolalpha << (string::npos != rabano.find(aba)) << "\n";

    cout << "Using xfrm:   " << aba << " in " << rabano << " == " << 
        boolalpha << (string::npos != xfrm(rabano).find(xfrm(aba))) << "\n";
}

However, as you can see... This doesn't do what you want. See the comment at your question.

Find substring in string using locale

3 Answers3

Linked