What's the correct way to count the bytes needed for a UTF8 conversion?

Question

I need to count the size, in bytes, that a substring will be once converted into a UTF8 byte array. This needs to happen without actually doing the conversion of that substring. The string I'm working with is very large, unfortunately, and I've got to be careful not to create another large string (or byte array) in memory.

There's a method on the Encoding.UTF8 object called GetByteCount, but I'm not seeing an overload that does it where I don't have to copy the string into a byte array. This doesn't work for me:

Encoding.UTF8.GetByteCount(stringToCount.ToCharArray(), startIndex, count);

because stringToCount.ToCharArray() will create a copy of my string.

Here's what I have right now:

public static int CalculateTotalBytesForUTF8Conversion(string stringToCount, int startIndex, int endIndex)
{
  var totalBytes = 0;
  for (int i = startIndex ; i < endIndex; i++)
    totalBytes += Encoding.UTF8.GetByteCount(new char[] { stringToCount[i] });

  return totalBytes;
}

The GetByteCount method doesn't appear to have the ability to take in just a char, so this was the compromise I'm at.

Is this the right way to determine the byte count of a substring, after conversion to UTF8, without actually doing that conversion? Or is there a better method to do this?

Take a look @ http://stackoverflow.com/questions/8511490/calculating-length-in-utf-8-of-java-string-without-actually-encoding-it (c# has ishighsurrogate on Char) — Alex K., Feb 09 '15 at 16:35

score 1 · Answer 1 · answered Feb 09 '15 at 16:38

There doesn't appear to be a built-in method for doing this, so you could either analyze the characters yourself or do the sort of thing you're doing above. The only thing I would recommend -- reuse a char[1] array, rather than creating a new array with each iteration. Here's an extension method that jives well with the built-in methods.

public static class EncodingExtensions
{
    public static int GetByteCount(this Encoding encoding, string s, int index, int count)
    {
        var output = 0;
        var end = index + count;
        var charArray = new char[1];
        for (var i = index; i < end; i++)
        {
            charArray[0] = s[i];
            output += Encoding.UTF8.GetByteCount(charArray);
        }
        return output;
    }
}

Great catch on not reallocating that char[]. That should save me several million instantiations. — Nathan, Feb 09 '15 at 16:47
There certainly *are* built-in methods to do this, but they aren't as simple to invoke as one might like. — Paul Turner, Feb 09 '15 at 17:01

Paul Turner · Answer 2 · 2015-02-10T00:18:31.417

1

So, there is an overload which doesn't require the caller create an array of characters first: Encoding.GetByteCount Method (Char*, Int32)

The issue is that this isn't a CLS-compliant method and will require you do some exotic coding:

public static unsafe int CalculateTotalBytesForUTF8Conversion(
    string stringToCount,
    int startIndex,
    int endIndex)
{
    // Fix the string in memory so we can grab a pointer to its location.
    fixed (char* stringStart = stringToCount)
    {
        // Get a pointer to the start of the substring.
        char* substring = stringStart + startIndex;

        return Encoding.UTF8.GetByteCount(substring, endIndex - startIndex);
    }
}

Key things to note here:

The method has to be marked unsafe, since we're working with pointers and direct memory manipulation.
The string is fixed for the duration of the call in order prevent the runtime moving it around - it gives us a constant location to point to, but it prevents the runtime doing memory optimization.

You should consider doing thorough performance profiling on this method to ensure it gives you a better performance profile than simply copying the string to an array.

A bit of basic profiling (a console application executing the algorithms in sequence on my desktop machine) shows that this approach executes ~35 times faster than looping over the string or converting it to a character-array.

Using pointer: ~86ms
Looping over string: ~2957ms
Converting to char array: ~3156ms

Take these figures with a pinch of salt, and also consider other factors besides just execution speed, such as long-term execution overheads (i.e. in a service process), or memory usage.

edited Feb 10 '15 at 00:18

answered Feb 09 '15 at 16:58

Paul Turner

38,949
15
102
166

In the code I'm dealing with, I know without a doubt that I cannot safely copy the string to a byte array, without risking an OutOfMemory exception. So I'm less concerned about the performance improvements this would supply than if it would cause any problems with a very large string (about 150 MB). I know it's terrible to have a string that big, but I don't have a choice at the moment. – Nathan Feb 09 '15 at 17:05
As a side note, this code throws an error: Cannot assign to 'substring' because it is a 'fixed variable'. So I created "char* startOfSubstring = substring + startIndex;" inside the fixed brackets, and used that for GetByteCount. – Nathan Feb 09 '15 at 17:09
You are right to find that error - I'll correct it. – Paul Turner Feb 09 '15 at 17:47
A 150 MB string is pretty crappy - it will wind up on the Large Object Heap. It does actually makes pinning it less of an issue though, since LOH is compacted much less frequently and has a much smaller impact on overall performance. – Paul Turner Feb 09 '15 at 17:52
Is there a reason to go with this approach over Michael Gunter's update to the method I had? In general I'd rather keep unsafe code out of the application, but I'm not sure if that's an unfounded issue of my own or not. – Nathan Feb 09 '15 at 20:04
This method will complete roughly 35 times faster on a string of approximately 150 MB than looping over the string (~86ms vs ~2957ms on my machine). Additionally, performing the conversion `ToCharArray()` on the string takes ~3156ms, making it comparable to the looping method in terms of execution time. – Paul Turner Feb 10 '15 at 00:05

What's the correct way to count the bytes needed for a UTF8 conversion?

2 Answers2