0

Program:

void foo() {

    string sourceStr = "Tag:贾鑫@VoltDB";
    string insertStr = "XinJia";
    int start = 4;
    int length = 2;

    sourceStr.erase(start, length);
    sourceStr.insert(start, insertStr);
    cout << sourceStr << endl;
}

For this program, I want to get output as "Tag:XinJia@VoltDB", but it seems that the std string erase and insert does not work for UTF-8 string.

Is there any boost library that I can use? How should I solve this problem?


After talking with others, I realize that there is no standard library that can solve this problem. So I write a function to do my work and would like to share it with others who have this similar problem:

std::string overlay_function(const char* sourceStr, size_t sourceLength,
        std::string insertStr, size_t startPos, size_t length) {
    int32_t i = 0, j = 0;
    while (i < sourceLength) {
        if ((sourceStr[i] & 0xc0) != 0x80) {
            if (++j == startPos) break;
        }
        i++;
    }
    std::string result = std::string(sourceStr, i);
    result.append(insertStr);

    bool reached = false;
    j = 0;
    while (i < sourceLength) {
        if ((sourceStr[i] & 0xc0) != 0x80) {
            if (reached) break;
            if (++j == length) reached = true;
        }
        i++;
    }

    result.append(std::string(&sourceStr[i], sourceLength - i));
    return result;
}

With this funciton, my program can be:

cout << overlay_function(sourceStr, sourceStr.length(), 4+1, 2) << endl;

Hope it helps.

xin
  • 161
  • 4
  • You might want to read this question: http://stackoverflow.com/q/3011082/10077 Some of the answers include library suggestions. – Fred Larson Apr 15 '14 at 19:40
  • "There's no UTF-8-aware iterator provided in the standard library". Thanks, that's what I need to know. Then I have to write my own function to deal with UTF8 string index. – xin Apr 16 '14 at 19:28
  • I did find a blog post that implements such an iterator. You might want to take a look: http://www.nubaria.com/en/blog/?p=371 – Fred Larson Apr 16 '14 at 20:14
  • I wrote a function to do this work, have not checked your iterator blog yet. But it works for me. Thanks. – xin Apr 16 '14 at 22:39

1 Answers1

0

Indices in C++ string are encoding value indices, not character (or in your case ideogram) indices. With UTF-8 each character can be composed of more than one encoding unit, and in your case it is so. Find the correct encoding unit index.

Tip 1: I'd use .substr and + string concatenation for this.

Tip 2: it seems that you can search for the characters : and @. Note that these encoding units cannot occur in multi-unit UTF-8 character. Check out the methods of string.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • Thanks for your suggestion. But I guess both tips does not work for me, as I am not sure the input string format. – xin Apr 17 '14 at 14:53
  • @xin Note that your condition `!= 0x80` won't work for UTF-8 characters in general, only those of two encoding units (at least I think so, haven't tested!). Instead of assuming things about bit patterns it's simpler and more clear and more robust and more correct to just check if (unsigned) code is `>=0x80`. If so ignore all but one in the run of such values, i.e. use an inner loop to scan forward. – Cheers and hth. - Alf Apr 17 '14 at 15:51