UTF-16 safe substring in C# .NET

Question

I want to get a substring of a given length say 150. However, I want to make sure I don't cut off the string in between a unicode character.

e.g. see the following code:

var str = "Hello world!";
var substr = str.Substring(0, 6);

Here substr is an invalid string since the smiley character is cut in half.

Instead I want a function that does as follows:

var str = "Hello world!";
var substr = str.UnicodeSafeSubstring(0, 6);

where substr contains "Hello"

For reference, here is how I would do it in Objective-C using rangeOfComposedCharacterSequencesForRange

NSString* str = @"Hello world!";
NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)];
NSString* substr = [message substringWithRange:range]];

What is the equivalent code in C#?

@Eser UTF-16 characters can be 2 or even 3 chars. So yes you can cut them in half. — Kostub Deshmukh, Aug 11 '15 at 07:07
@Eser read https://msdn.microsoft.com/en-us/library/system.string(v=vs.110).aspx#Characters A Char is a codepoint, a unicode character can contain more than 1 Char. For e.g. is 0xD83D 0xDE03 which is 2 16-bit chars. — Kostub Deshmukh, Aug 11 '15 at 07:12
I don't comprehend what should the substr function do... in `"Hello"`, what is the ``? And how should it work with [combining characters](https://en.wikipedia.org/wiki/Combining_character)? (so, for example, you could have `a + ̀` if you split it, you get the `a` without the diacritical mark... — xanatos, Aug 11 '15 at 07:21
@KostubDeshmukh if you know which unicode characters should not be "cut", you can have all of them inside a list or an array; Then with String.IndexOf() method to get its position and finally use Substring() to get what you want. See these links: [msdn](https://msdn.microsoft.com/en-us/library/system.string(v=vs.110).aspx#Characters) and [SO](http://stackoverflow.com/questions/4459571/how-to-recognize-if-a-string-contains-unicode-chars) and [SO](http://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c) — raidensan, Aug 11 '15 at 07:21
@Eser if you do `"".Length` you will get `2`. ive tested it. — M.kazem Akhgary, Aug 11 '15 at 07:41

score 10 · Answer 1 · answered Aug 11 '15 at 07:36

10

Looks like you're looking to split a string on graphemes, that is on single displayed characters.

In that case, you have a handy method: StringInfo.SubstringByTextElements:

var str = "Hello world!";
var substr = new StringInfo(str).SubstringByTextElements(0, 6);

answered Aug 11 '15 at 07:36

Lucas Trzesniewski

50,214
11
107
158

3

The only important thing to remember is that both `0` and `6` are in text elements units, and not in characters... If `str == ""` (where each glyph is 2 chars), `substr` will be `""`, so `substr.Length == 12` – xanatos Aug 11 '15 at 07:48
1

As @xanatos mentioned, this doesn't solve the issue when all the first 6 graphemes are 2 chars. I still want the length to be 6 + additional code points required for the last grapheme, i.e. return "" in the case xanatos mentioned. – Kostub Deshmukh Aug 11 '15 at 08:01
OK, I misunderstood the problem statement then. – Lucas Trzesniewski Aug 11 '15 at 08:10
Your solution works great, but something to keep in mind for someone stumbling across this later is the StringInfo API may differ on the version of .NET you are using. For me, `SubstringByTexElements` wasn't available in .NET Portable 4.5 – ymbcKxSw Jun 14 '18 at 20:23

score 8 · Accepted Answer · edited May 17 '22 at 12:50

This should return the maximal substring starting at index startIndex and with length up to length of "complete" graphemes... So initial/final "splitted" surrogate pairs will be removed, initial combining marks will be removed, final characters missing their combining marks will be removed.

Note that probably it isn't what you asked... You seem to want to use graphemes as the unit of measure (or perhaps you want to include the last grapheme even if its length will go over the length parameter)

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            if (startIndex > length)
            {
                break;
            }

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);

            if (startIndex == length)
            {
                break;
            }
        }

        return sb.ToString();
    }
}

Variant that will simply include "extra" characters at the end of the substring, if necessary to make whole a grapheme:

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            if (startIndex >= length)
            {
                break;
            }

            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);
        }

        return sb.ToString();
    }
}

This will return what you asked "Hello world!".UnicodeSafeSubstring(0, 6) == "Hello".

Note: It's worth pointing out that both of these solutions rely on StringInfo.GetTextElementEnumerator. This method didn't work as expected prior to a fix in .NET5, so if you're on an earlier version of .NET then this will split more complex multi-character emoji's.

This seems reasonable to me, though I was hoping for a builtin function. — Kostub Deshmukh, Aug 11 '15 at 08:05
@KostubDeshmukh Just added a variant that will include extra characters at the end of the substring... — xanatos, Aug 11 '15 at 08:05
Why do you skip graphemes that have their first character being a lower surrogate? Are they just considered malformed/invalid characters? — ymbcKxSw, Jun 19 '18 at 14:31
@ubarar Because they are "incomplete": a surrogate pair is composed by a high surrogate followed by a low surrogate. So if you start with a low surrogate then it is invalid. Combining marks is similar: they are for example diacritics that are put *after* the character (so imagine 'a' + '`')... So a combining mark as the first character is useless (because there is nothing before to combine with) — xanatos, Jun 19 '18 at 15:14
If you run "".UnicodeSafeSubstring(0,13), it outputs: "". Where input.length == 14, and output.length == 12. So it seems this doesn't work for more complex graphemes such as https://emojipedia.org/flag-wales/ — dazbradbury, May 16 '22 at 14:13
It looks like this depends on the .NET version you are running, and the underlying ```GetTextElementEnumerator``` is fixed in .NET 5: https://github.com/dotnet/docs/issues/16702 — dazbradbury, May 16 '22 at 14:51

score 2 · Answer 3 · answered Dec 25 '18 at 07:06

2

Here is a simple implementation for truncate (startIndex = 0):

string truncatedStr = (str.Length > maxLength)
    ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
    : str;

answered Dec 25 '18 at 07:06

Ilyan

175
6

This does not return what the original poster wants - it only returns "Hello" without the smiley. But if you have to truncate text to a certain limit of UTF-16 characters without cutting 4-byte characters in half, this is exactly what this function does. – Alexander Jul 09 '19 at 16:16

UTF-16 safe substring in C# .NET

3 Answers3

Linked