Get a substring that is at most N bytes UTF8 string

Question

I'm using some API that requires that input string is a valid UTF8 string with maximum length of 4096 bytes.

I had following function to trim the extra characters:

private static string GetTelegramMessage(string message)
{
    const int telegramMessageMaxLength = 4096; // https://core.telegram.org/method/messages.sendMessage#return-errors
    const string tooLongMessageSuffix = "...";

    if (message == null || message.Length <= 4096)
    {
        return message;
    }

    return message.Remove(telegramMessageMaxLength - tooLongMessageSuffix.Length) + tooLongMessageSuffix;
}

It didn't work well because characters != bytes and UTF16 chars != UTF8 chars.

So basically I need to convert my C# UTF16 string into UTF8 string with fixed length. I do

var bytes = Encoding.UTF8.GetBytes(myString);
// now I need to get first N characters with overall bytes size less than 4096 bytes

I can express my need in Rust (working example below):

fn main() {
    let foo = format!("{}{}", "ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ Uppen Sevarne staþe, sel þar him þuhte", (1..5000).map(|_| '1').collect::<String>());
    println!("{}", foo.len());
    let message = get_telegram_message(&foo);
    println!("{}", message);
    println!("{}", message.chars().count()); // 4035
    println!("{}", message.len()); // 4096
}

pub fn get_telegram_message(foo: &str) -> String {
    const PERIOD: &'static str = "...";
    const MAX_LENGTH: usize = 4096;
    let message_length = MAX_LENGTH - PERIOD.len();

    foo.chars()
        .map(|c| (c, c.len_utf8())) // getting length for evey char
        .scan((0, '\0'), |(s, _), (c, size)| {
            *s += size; //  running total for all previosely seen characters
            Some((*s, c))
        })
        .take_while(|(len, _)| len <= &message_length) // taking while running total is less than maximum message size
        .map(|(_, c)| c)
        .chain(PERIOD.chars()) // add trailing ellipsis
        .collect() // building a string
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=471ad0cbe9b0b01b50ec250d17dea233

The problem here is that I don't have chars() iterator in C# that allows me to treat bytes sequence as UTF8 characters.

I've played with Encoding.UTF8 a bit but I didn't find appropriate APIs to perform this task.

Linked articles is somehow related to my question, but first answer it just very bad, the second one reimplement UTF8 iterator (that's what I called IEnumerable<long> below). Since I know how to implement it, my question about builtin function to perform this task so neither of linked answers answers that.

You are mixing _utf-8_ with `string`. `string` is _utf-16_. If you want to work with _utf-8_ you have to `Encoding.UTF8.GetBytes()` and then work with the `byte[]`. — xanatos, Mar 04 '19 at 13:11
I understand that, that's what I written. I say that I don't see any `Utf8Iterator` that takes `byte[]` and return `IEnumerable` or something. — Alex Zhukovskiy, Mar 04 '19 at 13:14
.NET strings are UTF16. If you want to see how large a string's UTF8 representation is you'd have to convert it to a `byte[]` array first with `Encoding.UTF8.GetBytes()`. The Rust code you posted probably does the *same* thing in the `c.len_utf8()` method. The rest doesn't really explain what you want to do. — Panagiotis Kanavos, Mar 04 '19 at 13:14
If you're going to convert it to a `byte[]` and then cut the `byte[]` down to size, of course you need to make sure that you don't cut it in the middle of a codepoint. I think that's the challenge here...? — canton7, Mar 04 '19 at 13:16
@AlexZhukovskiy how would an `IEnumerable` help at all? Besides, you already have it. UTF16 is 16 bits long, not 32. `Char` can be cast implicitly to `Int32`. A string itself *is* an `IEnumerable`. That won't help you get the UTF8 length though. — Panagiotis Kanavos, Mar 04 '19 at 13:17
@PanagiotisKanavos he wants to take a byte array of UTF-8 encoded text and iterate through the codepoints in it (note that a `char` is not big enough to hold a codepoint). That way he can make sure that he only picks complete codepoints — canton7, Mar 04 '19 at 13:34
Possible duplicate of [Best way to shorten UTF8 string based on byte length](https://stackoverflow.com/questions/1225052/best-way-to-shorten-utf8-string-based-on-byte-length) — Heretic Monkey, Mar 04 '19 at 13:35
The accepted answer in @HereticMonkey's linked question is broken with surrogate pairs, however it looks like some of the others will work. — canton7, Mar 04 '19 at 13:49

canton7 · Accepted Answer · 2019-03-04T13:52:30.440

1

I think Encoder.Convert is probably the method you're after.

I interpreted the question as meaning

I have a string, which will be turned into UTF-8 bytes. I want to trim it such that its UTF-8 encoding is a maximum of 4096 bytes, but I want to make sure I don't trim it in the middle of a UTF-8 codepoint.

private static string GetTelegramMessage(string message)
{
    const int telegramMessageMaxLength = 4096; // https://core.telegram.org/method/messages.sendMessage#return-errors
    const string tooLongMessageSuffix = "...";

    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= telegramMessageMaxLength)
    {
        return message;
    }

    var encoder = Encoding.UTF8.GetEncoder();
    byte[] buffer = new byte[telegramMessageMaxLength - Encoding.UTF8.GetByteCount(tooLongMessageSuffix)];
    char[] messageChars = message.ToCharArray();
    encoder.Convert(
        chars: messageChars,
        charIndex: 0,
        charCount: messageChars.Length,
        bytes: buffer,
        byteIndex: 0,
        byteCount: buffer.Length,
        flush: false,
        charsUsed: out int charsUsed,
        bytesUsed: out int bytesUsed,
        completed: out bool completed);

    // I don't think we can return message.Substring(0, charsUsed)
    // as that's the number of UTF-16 chars, not the number of codepoints
    // (think about surrogate pairs). Therefore I think we need to
    // actually convert bytes back into a new string
    return Encoding.UTF8.GetString(bytes, 0, bytesUsed) + tooLongMessageSuffix;
}

edited Mar 04 '19 at 13:52

answered Mar 04 '19 at 13:25

canton7

37,633
3
64
77

Yes, you read it right. Seems like a solution. BRB when check for my situation. I've seen this method but I wasn't sure what happens if char is divided by buffer bounds. – Alex Zhukovskiy Mar 04 '19 at 13:31
@AlexZhukovskiy cool! I made a last-minute change to handle surrogate pairs - make sure you've caught that update. – canton7 Mar 04 '19 at 13:32
Sorry, I was busy for last hour. It passes my tests and produces same result as my Rust code so I think it's fine. Thank you, gonna try it in my production – Alex Zhukovskiy Mar 04 '19 at 14:45
@AlexZhukovskiy glad to hear! – canton7 Mar 04 '19 at 14:45
I changed code a bit for the case when method is called often and you don't want to allocate another string just to concatenate the final result. See https://gist.github.com/Pzixel/a8123e0731baed76fa52d6ee5a8d3c2e – Alex Zhukovskiy Mar 06 '19 at 09:40
@AlexZhukovskiy nice. I think I'd have used `int finalCharLength = encoding.GetChars(bytes, 0, bytesUsed, messageChars, 0)` (to avoid allocating another char array, which `GetString` does internally), then write the `.` as chars into `messageChars` (which avoids assuming that the UTF-8 encoding of `.` is the same as its ASCII encoding, although true), then `return new string(messageChars, 0, finalCharLength + 3)` – canton7 Mar 06 '19 at 10:05
In this case you reallocate `char` array (because `GetChars` won't return char array where you can write more dots) anyway, which doesn't really differ from `GetString`. – Alex Zhukovskiy Mar 06 '19 at 10:14
Yeah, you're right. You can allocate a larger `char[]` array to start, and then use `message.CopyTo` instead of `message.ToCharArray()`. I think it's safe to allocate `char[] messageChars = new char[message.Length + tooLongMessageSuffix.Length - 1]`, since we know we're going to have to remove at least one char. – canton7 Mar 06 '19 at 10:22
Well, I think I'l stick with my original modification since I know that UTF8 is a superset of the ASCII so it's safe to just place bytes in the array. Thank you for help anyway :) – Alex Zhukovskiy Mar 07 '19 at 11:04
This way saves you another array allocation though (the one that GetString does internally) - twice the savings! – canton7 Mar 07 '19 at 17:21
Look, we have at least 3 allocations: original string, byte buffer and resulting string. One extra allocation with `message.ToCharArray()`. We *could* just scan the string (without copying chars into the buffer) via iterator or even reinterpret `byte[]` as `char[]`, but C# doesn't allow that. – Alex Zhukovskiy Mar 07 '19 at 17:29
Plus an extra char array inside Encoding.GetString. Sure those allocations probably aren't consequential, but you were just keen to remove one of them and added a comment about it, so why not remove two of them, while at the same time making the method work for any encoding? – canton7 Mar 07 '19 at 17:44
I think Encoding.GetString reuse char array it returns. So It just do `new string(charBuffer)` instead of copying all of them once more. – Alex Zhukovskiy Mar 07 '19 at 19:59
No `new string(buffer)` does a copy. It has to, otherwise you could change the buffer and thereby mutate the string. – canton7 Mar 07 '19 at 20:39

Get a substring that is at most N bytes UTF8 string

1 Answers1