7

I'm saving some strings from a third party into my database (postgres). Sometimes these strings are too long and need to be truncated to fit into the column in my table.

On some random occasions I accidentally truncate the string right where there is a Unicode character, which gives me a "broken" string that I cannot save into the database. I get the following error: Unable to translate Unicode character \uD83D at index XXX to specified code page.

I've created a minimal example to show you what I mean. Here I have a string that contains a Unicode character ("Small blue diamond" U+1F539). Depending on where I truncate, it gives me a valid string or not.

var myString = @"This is a string before an emoji: This is after the emoji.";

var brokenString = myString.Substring(0, 34);
// Gives: "This is a string before an emoji:☐"

var test3 = myString.Substring(0, 35);
// Gives: "This is a string before an emoji:"

Is there a way for me to truncate the string without accidentally breaking any Unicode chars?

Joel
  • 8,502
  • 11
  • 66
  • 115

4 Answers4

6

A Unicode character may be represented with several chars, that is the problem with string.Substring you are having.

You may convert your string to a StringInfo object and then use SubstringByTextElements() method to get the substring based on the Unicode character count, not a char count.

See a C# demo:

Console.WriteLine("".Length); // => 2
Console.WriteLine(new StringInfo("").LengthInTextElements); // => 1

var myString = @"This is a string before an emoji:This is after the emoji.";
var teMyString = new StringInfo(myString);
Console.WriteLine(teMyString.SubstringByTextElements(0, 33));
// => "This is a string before an emoji:"
Console.WriteLine(teMyString.SubstringByTextElements(0, 34));
// => This is a string before an emoji:
Console.WriteLine(teMyString.SubstringByTextElements(0, 35));
// => This is a string before an emoji:T
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Alright! Thank you. I actually found this: https://stackoverflow.com/a/31936096/492067. How does that compare to your solution? The same? – Joel Sep 29 '17 at 08:39
  • @Joel I have studied the [accepted answer](https://stackoverflow.com/a/31936096/492067), and [compared with the current task](https://ideone.com/OcSafo). That substring method is tailored for that specific problem, see Xanatos's explanation: *So initial/final "splitted" surrogate pairs will be removed, initial combining marks will be removed, final characters missing their combining marks will be removed.* – Wiktor Stribiżew Sep 29 '17 at 08:57
  • I was just about to write the same :). I have gone with the accepted answer instead. – Joel Sep 29 '17 at 09:08
  • weird that if I do `var newStr = new StringInfo(text).SubstringByTextElements(0, maxChars);` then `newStr.Length` is not equal `maxChars`. What am I missing? – Toolkit Jan 06 '21 at 13:36
  • 1
    @Toolkit You are counting the length of a `string`, in order to get the count of chars in `newStr`, you need to create an instance of `StringInfo` again and then use the [`LengthInTextElements` property](https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.lengthintextelements?view=net-5.0), see [this C# demo](https://ideone.com/3sTDUX). – Wiktor Stribiżew Jan 06 '21 at 13:52
1

I ended up using a modification of xanatos answer here. The difference is that this version will strip the last grapheme, if adding it would give a string longer than length.

    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException(nameof(str));
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException(nameof(startIndex));
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException(nameof(length));
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException(nameof(length));
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var stringBuilder = new StringBuilder(length);

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            var grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            if (startIndex > str.Length)
            {
                break;
            }

            // Skip initial Low Surrogates/Combining Marks
            if (stringBuilder.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                var cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            // Do not append the grapheme if the resulting string would be longer than the required length
            if (stringBuilder.Length + grapheme.Length <= length)
            {
                stringBuilder.Append(grapheme);
            }

            if (stringBuilder.Length >= length)
            {
                break;
            }
        }

        return stringBuilder.ToString();
    }
}
Joel
  • 8,502
  • 11
  • 66
  • 115
1

Here is an example for truncate (startIndex = 0):

string truncatedStr = (str.Length > maxLength)
    ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
    : str;
Ilyan
  • 175
  • 6
0

Better truncate by the number of bytes not string length

   public static string TruncateByBytes(this string text, int maxBytes)
    {
        if (string.IsNullOrEmpty(text) || Encoding.UTF8.GetByteCount(text) <= maxBytes)
        {
            return text;
        }
        var enumerator = StringInfo.GetTextElementEnumerator(text);
        var newStr = string.Empty;
        do
        {
            enumerator.MoveNext();
            if (Encoding.UTF8.GetByteCount(newStr + enumerator.Current) <= maxBytes)
            {
                newStr += enumerator.Current;
            }
            else
            {
                break;
            }
        } while (true);
        return newStr;
    }
Toolkit
  • 10,779
  • 8
  • 59
  • 68