utf8 aware strncpy

Question

I find it hard to believe I'm the first person to run into this problem but searched for quite some time and didn't find a solution to this.

I'd like to use strncpy but have it be UTF8 aware so it doesn't partially write a utf8 code-point into the destination string.

Otherwise you can never be sure that the resulting string is valid UTF8, even if you know the source is (when the source string is larger than the max length).

Validating the resulting string can work but if this is to be called a lot it would be better to have a strncpy function that checks for it.

glib has g_utf8_strncpy but this copies a certain number of unicode chars, whereas Im looking for a copy function that limits by the byte length.

To be clear, by "utf8 aware", I mean that it should not exceed the limit of the destination buffer and it must never copy only part of a utf-8 code-point. (Given valid utf-8 input must never result in having invalid utf-8 output).

Note:

Some replies have pointed out that strncpy nulls all bytes and that it wont ensure zero termination, in retrospect I should have asked for a utf8 aware strlcpy, however at the time I didn't know of the existence of this function.

Normally you use fully UTF-8 aware libraries like ICU http://icu-project.org to solve these problems and, in the end, who guarantees that a char* is an UTF-8 string and not random garbage null terminated? — xanatos, Sep 08 '11 at 07:43
So what? `strncpy` doesn't guarantee to result in a zero ended C string as result either. Contrary to wide spread belief, strncpy is not a "string" function, but a buffer handling function. The 2 often forgotten side effects of it give a clue about that (the 2nd side effect of it is the nulling of the buffer in the size given). — Patrick Schlüter, Sep 08 '11 at 08:02
@Zan Lynx, resizable destination strings are not an option, the entire API/structs etc relies on fixed with strings. — ideasman42, Sep 15 '11 at 13:45
@tristopia, I don't see you're point, with you're own strncpy for utf8 it can be easily tweaked to behave however you like in regards to NULL termination. — ideasman42, Sep 15 '11 at 13:47
The iconv interface for doing this is very easy: just convert from utf8 to utf8 and deliberately shorten outbytesleft. — teambob, Feb 14 '12 at 04:55
Being utf-8 aware isn't enough. You need to not truncate in the middle of combining characters too! — Cory Nelson, Jan 08 '15 at 04:07
Why are you using legacy C functions to work with UTF-8 data? — Jesper Juhl, Mar 26 '23 at 03:01

Big Al · Answer 1 · 2015-01-08T04:53:12.200

I've tested this on many sample UTF8 strings with multi-byte characters. If the source is too long, it does a reverse search of it (starts at the null terminator) and works backward to find the last full UTF8 character which can fit into the destination buffer. It always ensures the destination is null terminated.

char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
    if( sizeDest ){
        size_t sizeSrc = strlen(src); // number of bytes not including null
        while( sizeSrc >= sizeDest ){

            const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
            while( lastByte-- > src )
                if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
                    break;

            sizeSrc = lastByte - src;
        }
        memcpy(dst, src, sizeSrc);
        dst[sizeSrc] = '\0';
    }
    return dst;
}

This is (nearly) the best algorithm here. I'm shocked (shocked!) that _nobody else_ made use of UTF-8's self-synchronizing ability, and the 4-byte maximum length of a UTF char, as a basis to simply find the last complete character by a limited search from the end and `memcpy` everything up to the last complete UTF-8 character. I'd advise however that if indeed `sizeSrc >= sizeDst`, then start with `lastByte = src + sizeDst`; It will be much faster, and at most 4 iterations of the loop will be required. — Iwillnotexist Idonotexist, Jan 08 '15 at 05:08

score 7 · Answer 2 · answered Sep 08 '11 at 08:05

7

I'm not sure what you mean by UTF-8 aware; strncpy copies bytes, not characters, and the size of the buffer is given in bytes as well. If what you mean is that it will only copy complete UTF-8 characters, stopping, for example, if there isn't room for the next character, I'm not aware of such a function, but it shouldn't be too hard to write:

int
utf8Size( char ch )
{
    static int const sizeTable[] =
    {
        //  ...
    };
    return sizeTable( static_cast<unsigned char>( ch ) )
}

char*
stru8ncpy( char* dest, char* source, int n )
{
    while ( *source != '\0' && utf8Size( *source ) < n ) {
        n -= utf8Size( *source );
        switch ( utf8Size( ch ) ) {
        case 6:
            *dest ++ = *source ++;
        case 5:
            *dest ++ = *source ++;
        case 4:
            *dest ++ = *source ++;
        case 3:
            *dest ++ = *source ++;
        case 2:
            *dest ++ = *source ++;
        case 1:
            *dest ++ = *source ++;
            break;
        default:
            throw IllegalUTF8();
        }
    }
    *dest = '\0';
    return dest;
}

(The contents of the table in utf8Size are a bit painful to generate, but this is a function you'll be using a lot if you're dealing with UTF-8, and you only have to do it once.)

answered Sep 08 '11 at 08:05

James Kanze

150,581
18
184
329

BTW, your function doesn't behave like [`strncpy`](http://blogs.msdn.com/b/oldnewthing/archive/2005/01/07/348437.aspx), more like [`strlcpy`](http://en.wikipedia.org/wiki/Strlcpy). – Maxim Egorushkin Sep 08 '11 at 09:22
1

@Hans, why 64MB? You only need to check the [first byte](http://en.wikipedia.org/wiki/UTF-8#Design) in order to get the current length. – Eran Sep 08 '11 at 09:24
Actually, @Hans, an unsigned char only has 256 possible values. – Jan Sep 08 '11 at 09:25
Ah, right. Don't really understand the code, no clue where *ch* gets its value. – Hans Passant Sep 08 '11 at 09:37
@iammilind Because the largest legal UTF-8 character is 6 bytes. – James Kanze Sep 08 '11 at 10:26
@Maxim Yes. My function does what is needed (a bounds checked `strcpy` which understands UTF-8), not what was asked for:-). – James Kanze Sep 08 '11 at 10:27
2

The UTF-8 representation of a Unicode character can never be more than 4 bytes long. Earlier proposals specified 5-byte and 6-byte sequences, but modern UTF-8 tops out at 4 bytes. – Stuart Cook Sep 08 '11 at 10:29
@Stuart Cook Yes. The six byte representation will, in fact, cover any 32 bit value. But current Unicode tops out at around 21 bits. – James Kanze Sep 08 '11 at 12:13
The table is not painful to generate. If `!(x&0x80)` it is a single byte character, otherwise you can count the number of high 1 bits in the first byte before seeing a zero, and that's the number of bytes in this char. This will not catch invalid UTF-8 sequences though. (Such as a continuation byte following something it shouldn't, or a multi-byte representation of a char value under 128.) UTF-8 is actually a very simple encoding, you can learn pretty much all you need to know from looking at Wikipedia. – asveikau Sep 08 '11 at 22:18
@asveikau Entering any table of 256 entries by hand is somewhat painful, although since many entries will be the same, a good editor can facilitate the job considerable. As for information about UTF-8, my usual reference is http://www.cl.cam.ac.uk/~mgk25/unicode.html. (The title says "for Unix/Linux", but practically everything in the page is also applicable for other systems.) – James Kanze Sep 09 '11 at 07:20
You do realize you can *generate* such tables, right? It's very simple to write a routine that counts the high bits in a byte before encountering a 0. Then it's very easy to use that routine to spit out a table of 256 values. I don't see where the pain is. – asveikau Sep 09 '11 at 17:11
@asveikau You can write a program to generate the tables, but with a good editor, it's probably simpler to do it manually (perhaps piping the results through a one-liner to generate comments indicating the range in each line). And "painful" is relative---there are certainly worse things that often have to be done, but it's still less interesting than writing code. – James Kanze Sep 12 '11 at 09:21

ideasman42 · Accepted Answer · 2016-10-02T22:37:29.067

3

To reply to own question, heres the C function I ended up with (Not using C++ for this project):

Notes: - Realize this is not a clone of strncpy for utf8, its more like strlcpy from openbsd. - utf8_skip_data copied from glib's gutf8.c - It doesn't validate the utf8 - which is what I intended.

Hope this is useful to others and interested in feedback, but please no pedantic zealot's about NULL termination behavior unless its an actual bug, or misleading/incorrect behavior.

Thanks to James Kanze who provided the basis for this, but was incomplete and C++ (I need a C version).

static const size_t utf8_skip_data[256] = {
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
    3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1
};

char *strlcpy_utf8(char *dst, const char *src, size_t maxncpy)
{
    char *dst_r = dst;
    size_t utf8_size;

    if (maxncpy > 0) {
        while (*src != '\0' && (utf8_size = utf8_skip_data[*((unsigned char *)src)]) < maxncpy) {
            maxncpy -= utf8_size;
            switch (utf8_size) {
                case 6: *dst ++ = *src ++;
                case 5: *dst ++ = *src ++;
                case 4: *dst ++ = *src ++;
                case 3: *dst ++ = *src ++;
                case 2: *dst ++ = *src ++;
                case 1: *dst ++ = *src ++;
            }
        }
        *dst= '\0';
    }
    return dst_r;
}

edited Oct 02 '16 at 22:37

answered Sep 15 '11 at 13:58

ideasman42

42,413
44
197
320

There is a small problem when `maxncpy` is 0. In this case, `dst` is still dereferenced and assigned with '\0'. Also, the performance of this is probably not very good. If you can guarantee that `dst` and `src` do not overlap, then you can use the C99 [`restrict` keyword](https://secure.wikimedia.org/wikipedia/en/wiki/Restrict). Otherwise, you can count the number of bytes to copy and then call `memmove`. – Daniel Trebbien Sep 17 '11 at 12:00
@Daniel Trebbien, thanks for the hint, in the version I have for the project `restrict` is now used. amd updated the function to check maxncpy>0 – ideasman42 Oct 31 '12 at 05:18
2

You are assuming unsigned characters. Many implementation default to signed characters. – wildplasser Oct 31 '12 at 10:20
@wildplasser, good point, I build with gcc's -funsigned-char, will update the example. – ideasman42 Nov 02 '12 at 14:45
Second point: your `*dst= '\0';` will write beyond the buffer if strlen(src) happens to be >= maxncpy. (the strncpy() behaviour would be to leave the dst string unterminated ...) – wildplasser Nov 03 '12 at 09:50
@wildplasser, I ran some tests and wasnt able to cause a buffer overrun - with 1 and 3 byte chars and a destination string that wont fit the source. Maybe I miss something - but I dont think it can happen. – ideasman42 Nov 08 '12 at 06:13
1

For embedded implementation very significant to minimize data and code size, so first `utf8_skip_data` should be uint8_t and second only last 32 bytes (last line) of table must present, other part of table described by two conditions: `if (char_value < 0xc0) { char_len = 1; } else if (char_value < 0xe0) { char_len = 2; } else { char_len = utf8_skip_data[char_value-0xe0]; }` – imbearr Dec 23 '19 at 08:29

wildplasser · Answer 4 · 2012-10-31T21:25:22.963

strncpy() is a terrible function:

If there is insufficient space, the resulting string will not be nul terminated.
If there is enough space, the remainder is filled with NULs. This can be painful if the target string is very big.

Even if the characters stay in the ASCII range (0x7f and below), the resulting string will not be what you want. In the UTF-8 case it might be not nul-terminated and end in an invalid UTF-8 sequence.

Best advice is to avoid strncpy().

EDIT: ad 1):

#include <stdio.h>
#include <string.h>

int main (void)
{
char buff [4];

strncpy (buff, "hello world!\n", sizeof buff );
printf("%s\n", buff );

return 0;
}

Agreed, the buffer will not be overrun. But the result is still unwanted. strncpy() solves only part of the problem. It is misleading and unwanted.

UPDATE(2012-10-31): Since this is a nasty problem, I decided to hack my own version, mimicking the ugly strncpy() behavior. The return value is the number of characters copied, though..

#include <stdio.h>
#include <string.h>

size_t utf8ncpy(char *dst, char *src, size_t todo);
static int cnt_utf8(unsigned ch, size_t len);

static int cnt_utf8(unsigned ch, size_t len)
{
if (!len) return 0;

if ((ch & 0x80) == 0x00) return 1;
else if ((ch & 0xe0) == 0xc0) return 2;
else if ((ch & 0xf0) == 0xe0) return 3;
else if ((ch & 0xf8) == 0xf0) return 4;
else if ((ch & 0xfc) == 0xf8) return 5;
else if ((ch & 0xfe) == 0xfc) return 6;
else return -1; /* Default (Not in the spec) */
}

size_t utf8ncpy(char *dst, char *src, size_t todo)
{
size_t done, idx, chunk, srclen;

srclen = strlen(src);
for(done=idx=0; idx < srclen; idx+=chunk) {
        int ret;
        for (chunk=0; done+chunk < todo; chunk++) {
                ret = cnt_utf8( src[idx+chunk], srclen - (idx+chunk) );
                if (ret ==1) continue;  /* Normal character: collect it into chunk */
                if (ret < 0) continue;  /* Bad stuff: treat as normal char */
                if (ret ==0) break;     /* EOF */
                if (!chunk) chunk = ret;/* an UTF8 multibyte character */
                else ret = 1;           /* we allready collected a number (chunk) of normal characters */
                break;
                }
        if (ret > 1 && done+chunk > todo) break;
        if (done+chunk > todo) chunk = todo - done;
        if (!chunk) break;
        memcpy( dst+done, src+idx, chunk);
        done += chunk;
        if (ret < 1) break;
        }
        /* This is part of the dreaded strncpy() behavior:
        ** pad the destination string with NULs
        ** upto its intended size
        */
if (done < todo) memset(dst+done, 0, todo-done);
return done;
}

int main(void)
{
char *string = "Hell\xc3\xb6 \xf1\x82\x82\x82, world\xc2\xa1!";
char buffer[30];
unsigned result, len;

for (len = sizeof buffer-1; len < sizeof buffer; len -=3) {
        result = utf8ncpy(buffer, string, len);
        /* remove the following line to get the REAL strncpy() behaviour */
        buffer[result] = 0;
        printf("Chop @%u\n", len );
        printf("Org:[%s]\n", string );
        printf("Res:%u\n", result );
        printf("New:[%s]\n", buffer );
        }

return 0;
}

Note that _if_ the result is a proper C string (i.e. nul-terminated), then it's also a proper UTF-8 string (i.e. no partial characters). And if it's not a proper C string, you should bail out to the error handler anyway. `strncpy` just ensures that you can safely get to that error handler. — MSalters, Sep 08 '11 at 10:44
UTF-8 is backward-compatible with ASCII; i.e. all ASCII strings are valid UTF-8 strings. — Daniel Trebbien, Sep 08 '11 at 22:17
But truncated utf8 strings are not valid. And unterminated strings are wrong in both cases. strncpy is worse than the problems it tries to solve. BTW: even if you "solve" the problem and produce a valid and terminated utf8 (or plain ascii) string, it is still truncated. What is the semantic value of the first xxx characters of a string? The program does not crash, but do you really want its results? — wildplasser, Sep 08 '11 at 22:28
I think you're answer misses the point - there are many discussions about strcpy / strncpy / strcpy_s - and how best to deal with NULL terminations. My question is that there does not seem to be any functions in common use which copy utf8, limit the buffer size, and ensure the resulting string is also valid utf8. — ideasman42, Sep 15 '11 at 07:41
@MSalters: If you don't have a proper C string, then don't call a function that expects a proper C string. You would not call `fopen` with an http URL, so why would you call a `str*` function with something that's not a string? — Secure, Oct 31 '12 at 05:46
@MSalters, why do you assume you would want the buffer being too small as an error?. we are talking abouy very low level functions here - strncpy/strlcpy etc - error handling belongs at a much higher level. also - you are incorrect by saying a NULL terminated C string is a propper utf8 string. — ideasman42, Nov 02 '12 at 07:56
@ideasman42: Missing my point. **If** the input to `strncpy` is a proper UTF-8 string, **and if** the result is a proper C string, then the whole string was copied, no UTF-8 sequence was cut short, and therefore the result is equally valid UTF-8 — MSalters, Nov 02 '12 at 08:49

score 1 · Answer 5 · answered Sep 08 '11 at 22:09

Here is a C++ solution:

u8string.h:

#ifndef U8STRING_H
#define U8STRING_H 1
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif

/**
 * Copies the first few characters of the UTF-8-encoded string pointed to by
 * \p src into \p dest_buf, as many UTF-8-encoded characters as can be written in
 * <code>dest_buf_len - 1</code> bytes or until the NUL terminator of the string
 * pointed to by \p str is reached.
 *
 * The string of bytes that are written into \p dest_buf is NUL terminated
 * if \p dest_buf_len is greater than 0.
 *
 * \returns \p dest_buf
 */
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len);

#ifdef __cplusplus
}
#endif
#endif

u8slbcpy.cpp:

#include "u8string.h"

#include <cstring>
#include <utf8.h>

char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len)
{
    if (dest_buf_len <= 0) {
        return dest_buf;
    } else if (dest_buf_len == 1) {
        dest_buf[0] = '\0';
        return dest_buf;
    }

    size_t num_bytes_remaining = dest_buf_len - 1;
    utf8::unchecked::iterator<const char *> it(src);
    const char * prev_base = src;
    while (*it++ != '\0') {
        const char *base = it.base();
        ptrdiff_t diff = (base - prev_base);
        if (num_bytes_remaining < diff) {
            break;
        }
        num_bytes_remaining -= diff;
        prev_base = base;
    }

    size_t n = dest_buf_len - 1 - num_bytes_remaining;
    std::memmove(dest_buf, src, n);
    dest_buf[n] = '\0';

    return dest_buf;
}

The function u8slbcpy() has a C interface, but it is implemented in C++. My implementation uses the header-only UTF8-CPP library.

I think that this is pretty much what you are looking for, but note that there is still the problem that one or more combining characters might not be copied if the combining characters apply to the n^th character (itself not a combining character) and the destination buffer is just large enough to store the UTF-8 encoding of characters 1 through n, but not the combining characters of character n. In this case, the bytes representing characters 1 through n are written, but none of the combining characters of n are. In effect, you could say that the n^th character is partially written.

Sirmabus · Answer 6 · 2012-10-28T15:31:20.683

To comment on the above answer "strncpy() is a terrible function:". I hate to even comment on such blanket statements at the expense of creating yet another internet programming jihad, but will anyhow since statements like this are misleading to those that might come here to look for answers.

Okay maybe C string functions are "old school". Maybe all strings in C/C++ should be in some kind of smart containers, etc., maybe one should use C++ instead of C (when you have a choice), these are more of a preference and an argument for other topics.

I came here looking for a UTF-8 strncpy() my self. Not that I couldn't make one (the encoding is IMHO simple and elegant) but wanted to see how others made theirs and perhaps find a optimized in ASM one.

To the "gods gift" of the programming world people, put your hubris aside for a moment and look at some facts.

There is nothing wrong with "strncpy()", or any other of the similar functions with the same side effects and issues like "_snprintf()", etc.

I say: "strncpy() is not terrible", but rather "terrible programmers use it terribly".

What is "terrible" is not knowing the rules. Furthermore on the whole subject because of security (like buffer overrun) and program stability implications, there wouldn't be a need for example Microsoft to add to it's CRT lib "Safe String Functions" if the rules were just followed.

The main ones:

"sizeof()" returns the length of a static string w/terminator.
"strlen()" returns the length of string w/o terminator.
Most if no all "n" functions just clamp to 'n' with out adding a terminator.
There is implicit ambiguity on what "buffer size" is in functions that require and input buffer size. I.E. The "(char *pszBuffer, int iBufferSize)" types. Safer to assume the worst and pass a size one less then the actual buffer size, and adding a terminator at the end to be sure.
For string inputs, buffers, etc., set and use a reasonable size limit based on expected average and maximum. To hopefully avoid input truncation, and to eliminate buffer overruns period.

This is how I personally handle such things, and other rules that are just to be known and practiced.

A handy macro for static string size:

// Size of a string with out terminator
#define SIZESTR(x) (sizeof(x) - 1)

When declaring local/stack string buffers:

A) The size for example limited to 1023+1 for terminator to allow for strings up to 1023 chars in length.

B) I'm initializing the the string to zero in length, plus terminating at the very end to cover a possible 'n' truncation.

char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;

Alternately one could do just: char szBuffer[1024] = {0}; of course but then there is some performance implication for a compiler generated "memset() like call to zero the whole buffer. It makes things cleaner for debugging though, and I prefer this style for static (vs local/stack) strings buffers.

Now a "strncpy()" following the rules:

char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0; 
strncpy(szBuffer, pszSomeInput, SIZESTR(szBuffer));

There are other "rules" and issues of course, but these are the main ones that come to mind. You just got to know how the lib functions work and to use safe practices like this.

Finally in my project I use ICU anyhow so I decided to go with it and use the macros in "utf8.h" to make my own "strncpy()".

*"To hopefully avoid input truncation"* Programming on hope is nothing I'd consider "safe practices". I prefer to always get a signal when a string is truncated, because most of the times a truncation should be treated as an error, instead of silently ignoring it. A function that doesn't give me such a signal is out of the question, regardless if it is terrible in itself, has a terrible name for what it does or is a terrible choice for the intented purpose. — Secure, Oct 31 '12 at 05:54
Please note that you add an immense amount of brittle logic, just to compensate for strncpy()s shortcomings. IMO it would have been easier to create your own function (which would do *exactly* what you want) instead of trying make strncpy() jump through hoops. — wildplasser, Nov 04 '12 at 13:36

utf8 aware strncpy

Note:

6 Answers6