16

I want to get a substring of a given length say 150. However, I want to make sure I don't cut off the string in between a unicode character.

e.g. see the following code:

var str = "Hello world!";
var substr = str.Substring(0, 6);

Here substr is an invalid string since the smiley character is cut in half.

Instead I want a function that does as follows:

var str = "Hello world!";
var substr = str.UnicodeSafeSubstring(0, 6);

where substr contains "Hello"

For reference, here is how I would do it in Objective-C using rangeOfComposedCharacterSequencesForRange

NSString* str = @"Hello world!";
NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)];
NSString* substr = [message substringWithRange:range]];

What is the equivalent code in C#?

Doomjunky
  • 1,148
  • 17
  • 18
Kostub Deshmukh
  • 2,852
  • 2
  • 25
  • 35
  • @Eser UTF-16 characters can be 2 or even 3 chars. So yes you can cut them in half. – Kostub Deshmukh Aug 11 '15 at 07:07
  • @Eser read https://msdn.microsoft.com/en-us/library/system.string(v=vs.110).aspx#Characters A Char is a codepoint, a unicode character can contain more than 1 Char. For e.g. is 0xD83D 0xDE03 which is 2 16-bit chars. – Kostub Deshmukh Aug 11 '15 at 07:12
  • I don't comprehend what should the substr function do... in `"Hello"`, what is the ``? And how should it work with [combining characters](https://en.wikipedia.org/wiki/Combining_character)? (so, for example, you could have `a + ̀` if you split it, you get the `a` without the diacritical mark... – xanatos Aug 11 '15 at 07:21
  • @KostubDeshmukh if you know which unicode characters should not be "cut", you can have all of them inside a list or an array; Then with String.IndexOf() method to get its position and finally use Substring() to get what you want. See these links: [msdn](https://msdn.microsoft.com/en-us/library/system.string(v=vs.110).aspx#Characters) and [SO](http://stackoverflow.com/questions/4459571/how-to-recognize-if-a-string-contains-unicode-chars) and [SO](http://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c) – raidensan Aug 11 '15 at 07:21
  • @Eser if you do `"".Length` you will get `2`. ive tested it. – M.kazem Akhgary Aug 11 '15 at 07:41

3 Answers3

10

Looks like you're looking to split a string on graphemes, that is on single displayed characters.

In that case, you have a handy method: StringInfo.SubstringByTextElements:

var str = "Hello world!";
var substr = new StringInfo(str).SubstringByTextElements(0, 6);
Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • 3
    The only important thing to remember is that both `0` and `6` are in text elements units, and not in characters... If `str == ""` (where each glyph is 2 chars), `substr` will be `""`, so `substr.Length == 12` – xanatos Aug 11 '15 at 07:48
  • 1
    As @xanatos mentioned, this doesn't solve the issue when all the first 6 graphemes are 2 chars. I still want the length to be 6 + additional code points required for the last grapheme, i.e. return "" in the case xanatos mentioned. – Kostub Deshmukh Aug 11 '15 at 08:01
  • OK, I misunderstood the problem statement then. – Lucas Trzesniewski Aug 11 '15 at 08:10
  • Your solution works great, but something to keep in mind for someone stumbling across this later is the StringInfo API may differ on the version of .NET you are using. For me, `SubstringByTexElements` wasn't available in .NET Portable 4.5 – ymbcKxSw Jun 14 '18 at 20:23
8

This should return the maximal substring starting at index startIndex and with length up to length of "complete" graphemes... So initial/final "splitted" surrogate pairs will be removed, initial combining marks will be removed, final characters missing their combining marks will be removed.

Note that probably it isn't what you asked... You seem to want to use graphemes as the unit of measure (or perhaps you want to include the last grapheme even if its length will go over the length parameter)

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            if (startIndex > length)
            {
                break;
            }

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);

            if (startIndex == length)
            {
                break;
            }
        }

        return sb.ToString();
    }
}

Variant that will simply include "extra" characters at the end of the substring, if necessary to make whole a grapheme:

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            if (startIndex >= length)
            {
                break;
            }

            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);
        }

        return sb.ToString();
    }
}

This will return what you asked "Hello world!".UnicodeSafeSubstring(0, 6) == "Hello".

Note: It's worth pointing out that both of these solutions rely on StringInfo.GetTextElementEnumerator. This method didn't work as expected prior to a fix in .NET5, so if you're on an earlier version of .NET then this will split more complex multi-character emoji's.

dazbradbury
  • 5,729
  • 5
  • 34
  • 38
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • This seems reasonable to me, though I was hoping for a builtin function. – Kostub Deshmukh Aug 11 '15 at 08:05
  • @KostubDeshmukh Just added a variant that will include extra characters at the end of the substring... – xanatos Aug 11 '15 at 08:05
  • Why do you skip graphemes that have their first character being a lower surrogate? Are they just considered malformed/invalid characters? – ymbcKxSw Jun 19 '18 at 14:31
  • 2
    @ubarar Because they are "incomplete": a surrogate pair is composed by a high surrogate followed by a low surrogate. So if you start with a low surrogate then it is invalid. Combining marks is similar: they are for example diacritics that are put *after* the character (so imagine 'a' + '`')... So a combining mark as the first character is useless (because there is nothing before to combine with) – xanatos Jun 19 '18 at 15:14
  • If you run "".UnicodeSafeSubstring(0,13), it outputs: "". Where input.length == 14, and output.length == 12. So it seems this doesn't work for more complex graphemes such as https://emojipedia.org/flag-wales/ – dazbradbury May 16 '22 at 14:13
  • 1
    It looks like this depends on the .NET version you are running, and the underlying ```GetTextElementEnumerator``` is fixed in .NET 5: https://github.com/dotnet/docs/issues/16702 – dazbradbury May 16 '22 at 14:51
2

Here is a simple implementation for truncate (startIndex = 0):

string truncatedStr = (str.Length > maxLength)
    ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
    : str;
Ilyan
  • 175
  • 6
  • This does not return what the original poster wants - it only returns "Hello" without the smiley. But if you have to truncate text to a certain limit of UTF-16 characters without cutting 4-byte characters in half, this is exactly what this function does. – Alexander Jul 09 '19 at 16:16