2

I have an XML which is stored in a char array - char[] - and I have the content length of the data in an int variable. I need to deserialize the data with XmlSerializer.

For performance reasons, I need to avoid allocating a string object because the data is usually >85kb and will result in a Gen2 object.

Is there any way to pass the char[] to XmlSerializer without converting it to a string? It accepts a Stream or a TextReader but I can't find a way to construct one from a char[].

I am imagining something like this (except C# doesn't have a CharArrayStream or CharArrayReader):

public MyEntity DeserializeXmlDocument(char [] buffer, int contentLength) {
    using (var stream = new CharArrayStream(buffer, contentLength))
    {
        return _xmlSerializer.Deserialize(stream) as MyEntity;
    }
}

Just as some more info, we are at a point when we are profiling existing code and have identified a pain-point, so this is not a case of "premature optimization" or a "XY problem".

sashoalm
  • 75,001
  • 122
  • 434
  • 781
  • You can wrap the `char[]` into a `Stream` easily. [Here](https://stackoverflow.com/a/57100948/5114784) is an earlier answer of mine to a similar issue with strings, which prevents unnecessary copying. With a minimal effort you can change it to use `char[]` instead of `string`. Is it enough or should I post a new answer for it? – György Kőszeg Apr 13 '20 at 09:42
  • The link has nothing to do with XML. – jdweng Apr 13 '20 at 09:46
  • Treating the Xml as char[] isn't going to solve the memory issue nor the speed issue. XmlSerializer is slow. It would be better to use your own parser based on Xml Linq. – jdweng Apr 13 '20 at 09:48
  • @jdweng: As @sashoalm also mentions the `XmlSerializer` can be created from a `Stream`. Btw, LinqToXml consumes way too much resources as it keeps the whole XML in the memory. If that matters you need to use the low-level `XmlReader`. But the OP already has the raw XML content as a char array, – György Kőszeg Apr 13 '20 at 09:51
  • 1
    There is no immediate option to use `char[]` with `XmlReader`, short of implementing your own `TextReader` sub-type. Personally, I'd look at whether the data could have been left in a `byte[]` (not decoded) or in a file, and use `StreamReader` (with `FileStream` or `MemoryStream`). However, based on *lots* of experience in this area, I really don't think that the performance problem has anything to do with that one extra string, so frankly using `StringReader` on a `new string` from your `char[]` will behave virtually identically. Ultimately `XmlSerializer` - and XML in general -... – Marc Gravell Apr 13 '20 at 09:53
  • ...isn't known for efficiency. If that is your goal, frankly you might want to consider alternative serializers (and data formats). – Marc Gravell Apr 13 '20 at 09:54
  • I didn't say to use anonymous type. You can create classes to reduce memory. – jdweng Apr 13 '20 at 09:56
  • Yes. Java has CharArrayReader but C# unfortunately does not. I can also rework my code to use byte[] instead of char[] easily. I will try to rework the code linked by @György to create a CharArrayStream or a ByteArrayStream. – sashoalm Apr 13 '20 at 09:58

2 Answers2

1

It's fairly straightforward to subclass TextReader to read from an array of chars or equivalent. Here's a version that takes a ReadOnlyMemory<char> that could represent a slice of either a string or a char [] character array:

public sealed class CharMemoryReader : TextReader
{
    private ReadOnlyMemory<char> chars;
    private int position;

    public CharMemoryReader(ReadOnlyMemory<char> chars)
    {
        this.chars = chars;
        this.position = 0;
    }

    void CheckClosed()
    {
        if (position < 0)
            throw new ObjectDisposedException(null, string.Format("{0} is closed.", ToString()));
    }

    public override void Close() => Dispose(true);

    protected override void Dispose(bool disposing)
    {
        chars = ReadOnlyMemory<char>.Empty;
        position = -1;
        base.Dispose(disposing);
    }

    public override int Peek()
    {
        CheckClosed();
        return position >= chars.Length ? -1 : chars.Span[position];
    }

    public override int Read()
    {
        CheckClosed();
        return position >= chars.Length ? -1 : chars.Span[position++];
    }

    public override int Read(char[] buffer, int index, int count)
    {
        CheckClosed();
        if (buffer == null)
            throw new ArgumentNullException(nameof(buffer));
        if (index < 0)
            throw new ArgumentOutOfRangeException(nameof(index));
        if (count < 0)
            throw new ArgumentOutOfRangeException(nameof(count));
        if (buffer.Length - index < count)
            throw new ArgumentException("buffer.Length - index < count");

        return Read(buffer.AsSpan().Slice(index, count));
    }

    public override int Read(Span<char> buffer)
    {
        CheckClosed();

        var nRead = chars.Length - position;
        if (nRead > 0)
        {
            if (nRead > buffer.Length)
                nRead = buffer.Length;
            chars.Span.Slice(position, nRead).CopyTo(buffer);
            position += nRead;
        }
        return nRead;
    }

    public override string ReadToEnd()
    {
        CheckClosed();
        var s = position == 0 ? chars.ToString() : chars.Slice(position, chars.Length - position).ToString();
        position = chars.Length;
        return s;
    }

    public override string ReadLine()
    {
        CheckClosed();
        var span = chars.Span;
        var i = position;
        for( ; i < span.Length; i++)
        {
            var ch = span[i];
            if (ch == '\r' || ch == '\n')
            {
                var result = span.Slice(position, i - position).ToString();
                position = i + 1;
                if (ch == '\r' && position < span.Length && span[position] == '\n')
                    position++;
                return result;
            }
        }
        if (i > position)
        {
            var result = span.Slice(position, i - position).ToString();
            position = i;
            return result;
        }
        return null;
    }

    public override int ReadBlock(char[] buffer, int index, int count) => Read(buffer, index, count);
    public override int ReadBlock(Span<char> buffer) => Read(buffer);

    public override Task<String> ReadLineAsync() => Task.FromResult(ReadLine());
    public override Task<String> ReadToEndAsync() => Task.FromResult(ReadToEnd());
    public override Task<int> ReadBlockAsync(char[] buffer, int index, int count) => Task.FromResult(ReadBlock(buffer, index, count));
    public override Task<int> ReadAsync(char[] buffer, int index, int count) => Task.FromResult(Read(buffer, index, count));
    public override ValueTask<int> ReadBlockAsync(Memory<char> buffer, CancellationToken cancellationToken = default) =>
        cancellationToken.IsCancellationRequested ? new ValueTask<int>(Task.FromCanceled<int>(cancellationToken)) : new ValueTask<int>(ReadBlock(buffer.Span));
    public override ValueTask<int> ReadAsync(Memory<char> buffer, CancellationToken cancellationToken = default) =>
        cancellationToken.IsCancellationRequested ? new ValueTask<int>(Task.FromCanceled<int>(cancellationToken)) : new ValueTask<int>(Read(buffer.Span)); 
}

Then use it with one of the following extension methods:

public static partial class XmlSerializationHelper
{
    public static T LoadFromXml<T>(this char [] xml, int contentLength, XmlSerializer serial = null) => 
        new ReadOnlyMemory<char>(xml, 0, contentLength).LoadFromXml<T>(serial);

    public static T LoadFromXml<T>(this ReadOnlyMemory<char> xml, XmlSerializer serial = null)
    {
        serial = serial ?? new XmlSerializer(typeof(T));
        using (var reader = new CharMemoryReader(xml))
            return (T)serial.Deserialize(reader);
    }
}

E.g.

var result = buffer.LoadFromXml<MyEntity>(contentLength, _xmlSerializer);

Notes:

  • A char [] character array has basically the same contents as a UTF-16 encoded memory stream without a BOM, so one could create a custom Stream implementation resembling MemoryStream that represents each char as two bytes, as is done in this answer to How do I generate a stream from a string? by György Kőszeg. It looks a bit tricky to do this entirely correctly however, as getting all the async methods right seems nontrivial.

    Having done so XmlReader will still need to wrap the custom stream with a StreamReader that "decodes" the stream into a sequence of characters, correctly inferring the encoding in the process (which I have observed may occasionally be done wrongly, e.g. when the encoding stated the XML declaration does not match the actual encoding).

    I chose to create a custom TextReader rather than a custom Stream to avoid the unnecessary decoding step, and because the async implementation seemed less burdensome.

  • Representing each char as a single byte via truncation (e.g. (byte)str[i]) will corrupt XML containing any multibyte characters.

  • I haven't done any performance tuning on the above implementation.

Demo fiddle here.

dbc
  • 104,963
  • 20
  • 228
  • 340
0

I reworked the code linked by @György Kőszeg to a class CharArrayStream. This works so far in my tests:

public class CharArrayStream : Stream
{
    private readonly char[] str;
    private readonly int n;

    public override bool CanRead => true;
    public override bool CanSeek => true;
    public override bool CanWrite => false;
    public override long Length => n;
    public override long Position { get; set; } // TODO: bounds check

    public CharArrayStream(char[] str, int n)
    {
        this.str = str;
        this.n = n;
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        switch (origin)
        {
            case SeekOrigin.Begin:
                Position = offset;
                break;
            case SeekOrigin.Current:
                Position += offset;
                break;
            case SeekOrigin.End:
                Position = Length - offset;
                break;
        }

        return Position;
    }

    private byte this[int i] => (byte)str[i];

    public override int Read(byte[] buffer, int offset, int count)
    {
        // TODO: bounds check
        var len = Math.Min(count, Length - Position);
        for (int i = 0; i < len; i++)
        {
            buffer[offset++] = this[(int)(Position++)];
        }
        return (int)len;
    }

    public override int ReadByte() => Position >= Length ? -1 : this[(int)Position++];
    public override void Flush() { }
    public override void SetLength(long value) => throw new NotSupportedException();
    public override void Write(byte[] buffer, int offset, int count) => throw new NotSupportedException();
    public override string ToString() => throw new NotSupportedException();
}

I can use it in this way:

public MyEntity DeserializeXmlDocument(char [] buffer, int contentLength) {
    using (var stream = new CharArrayStream(buffer, contentLength))
    {
        return _xmlSerializer.Deserialize(stream) as MyEntity;
    }
}

Thanks, @György Kőszeg!

sashoalm
  • 75,001
  • 122
  • 434
  • 781
  • This implementation does not work when the XML contains multibyte (non-ASCII) characters. See: https://dotnetfiddle.net/yjlYGF. György Kőszeg's original version represents each `char` as two bytes so seems to handle multibyte characters correctly, see https://dotnetfiddle.net/ppgrGu. – dbc Apr 17 '20 at 21:15
  • I think the Stream class doesn't have a concept of encoding. It just treats data as a binary stream of bytes, not text. – sashoalm Apr 17 '20 at 21:47
  • Yes that's correct. But a [`char`](https://learn.microsoft.com/en-us/dotnet/api/system.char?view=netframework-4.8) in .Net is a two-byte structure which is why György Kőszeg's implementation works as it does. When you do `(byte)str[i]` you are truncating a two-byte value into a single byte. – dbc Apr 17 '20 at 21:49
  • My bad, coming from C++ I was treating char and byte as the same. – sashoalm Apr 17 '20 at 21:52
  • Ah I see. In .NET `Stream` represents a sequence of bytes; that in turn gets wrapped in a `StreamReader` (subclass of `TextReader`) which decodes the bytes into a sequence of Unicode chars, which gets wrapped in `XmlReader` which parses the character sequence into a sequence of XML nodes, which is used by `XmlSerializer` to deserialize the XML. When you pass a `Stream` into `XmlSerializer` all this wrapping happens internally. By creating our own `TextReader` subclass we are injecting the `char` array at a slightly higher, arguably more appropriate level. – dbc Apr 17 '20 at 21:55