Reading string as a stream without copying

Question

I have some data in a string. I have a function that takes a stream as input. I want to provide my data to my function without having to copy the complete string into a stream. Essentially I'm looking for a stream class that can wrap a string and read from it.

The only suggestions I've seen online suggest the StringReader which is NOT a stream, or creating a memory stream and writing to it, which means copying the data. I could write my own stream object but the tricky part is handling encoding because a stream deals in bytes. Is there a way to do this without writing new stream classes?

I'm implementing pipeline components in BizTalk. BizTalk deals with everything entirely with streams, so you always pass things to BizTalk in a stream. BizTalk will always read from that stream in small chunks, so it doesn't make sense to copy the entire string to a stream (especially if the string is large), if I can read from the stream how BizTalk wants it.

Do you realize that `Stream` can only *copy* data? (e.g. into an array provided to `Read`). — Peter Ritchie, Oct 02 '14 at 19:57
You will have to decode the string at some point to get the actual bytes for that string in the encoding you want. You're going to have to make a copy whether you want to or not. — Jeff Mercado, Oct 02 '14 at 20:00
@Peter Ritchie - Perhaps I inappropriately miss-phrased the question, but typically it happens in small chucks, via the read method. Using memory stream copies the entire string all at once. — Jeremy, Oct 02 '14 at 20:28
If you're looking at ways of reducing copying data that may be sent/received, you might want to look at `ArraySegment` http://msdn.microsoft.com/en-us/library/1hsbd92d(v=vs.110).aspx — Peter Ritchie, Oct 02 '14 at 21:47
If you're in a Pipeline Component, is there a way to not create the string in the first place? — Johns-305, Oct 03 '14 at 19:51

xmedeko · Answer 1 · 2023-04-11T07:18:43.890

Here is a proper StringReaderStream with following drawbacks:

The buffer for Read has to be at least maxBytesPerChar long. It's possible to implement Read for small buffers by keeping internal one char buff = new byte[maxBytesPerChar]. But's not necessary for most usages.
No Seek, it's possible to do seek, but would be very tricky in general. (Some seek cases, like seek to beginning, seek to end, are simple to implement. )

/// <summary>
/// Convert string to byte stream.
/// <para>
/// Slower than <see cref="Encoding.GetBytes()"/>, but saves memory for a large string.
/// </para>
/// </summary>
public class StringReaderStream : Stream
{
    private string input;
    private readonly Encoding encoding;
    private int maxBytesPerChar;
    private int inputLength;
    private int inputPosition;
    private readonly long length;
    private long position;

    public StringReaderStream(string input)
        : this(input, Encoding.UTF8)
    { }

    public StringReaderStream(string input, Encoding encoding)
    {
        this.encoding = encoding ?? throw new ArgumentNullException(nameof(encoding));
        this.input = input;
        inputLength = input == null ? 0 : input.Length;
        if (!string.IsNullOrEmpty(input))
            length = encoding.GetByteCount(input);
            maxBytesPerChar = encoding == Encoding.ASCII ? 1 : encoding.GetMaxByteCount(1);
    }

    public override bool CanRead => true;

    public override bool CanSeek => false;

    public override bool CanWrite => false;

    public override long Length => length;

    public override long Position
    {
        get => position;
        set => throw new NotImplementedException();
    }

    public override void Flush()
    {
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        if (inputPosition >= inputLength)
            return 0;
        if (count < maxBytesPerChar)
            throw new ArgumentException("count has to be greater or equal to max encoding byte count per char");
        int charCount = Math.Min(inputLength - inputPosition, count / maxBytesPerChar);
        int byteCount = encoding.GetBytes(input, inputPosition, charCount, buffer, offset);
        inputPosition += charCount;
        position += byteCount;
        return byteCount;
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotImplementedException();
    }

    public override void SetLength(long value)
    {
        throw new NotImplementedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotImplementedException();
    }
}

dbc · Answer 2 · 2021-05-06T20:06:05.477

While this question was originally tagged c#-4.0, this can be done fairly easily in .NET 5 with the introduction of Encoding.CreateTranscodingStream:

Creates a Stream that serves to transcode data between an inner Encoding and an outer Encoding, similar to Convert(Encoding, Encoding, Byte[]).

The trick is to define an underlying UnicodeStream that directly accesses the bytes of the string then wrap that in the transcoding stream to present streamed content with the required encoding.

The following classes and extension method do the job:

public static partial class TextExtensions
{
    public static Encoding PlatformCompatibleUnicode => BitConverter.IsLittleEndian ? Encoding.Unicode : Encoding.BigEndianUnicode;
    static bool IsPlatformCompatibleUnicode(this Encoding encoding) => BitConverter.IsLittleEndian ? encoding.CodePage == 1200 : encoding.CodePage == 1201;
    
    public static Stream AsStream(this string @string, Encoding encoding = default) => 
        (@string ?? throw new ArgumentNullException(nameof(@string))).AsMemory().AsStream(encoding);
    public static Stream AsStream(this ReadOnlyMemory<char> charBuffer, Encoding encoding = default) =>
        ((encoding ??= Encoding.UTF8).IsPlatformCompatibleUnicode())
            ? new UnicodeStream(charBuffer)
            : Encoding.CreateTranscodingStream(new UnicodeStream(charBuffer), PlatformCompatibleUnicode, encoding, false);
}

sealed class UnicodeStream : Stream
{
    const int BytesPerChar = 2;

    // By sealing UnicodeStream we avoid a lot of the complexity of MemoryStream.
    ReadOnlyMemory<char> charMemory;
    int position = 0;
    Task<int> _cachedResultTask; // For async reads, avoid allocating a Task.FromResult<int>(nRead) every time we read.

    public UnicodeStream(string @string) : this((@string ?? throw new ArgumentNullException(nameof(@string))).AsMemory()) { }
    public UnicodeStream(ReadOnlyMemory<char> charMemory) => this.charMemory = charMemory;

    public override int Read(Span<byte> buffer)
    {
        EnsureOpen();
        var charPosition = position / BytesPerChar;
        // MemoryMarshal.AsBytes will throw on strings longer than int.MaxValue / 2, so only slice what we need. 
        var byteSlice = MemoryMarshal.AsBytes(charMemory.Slice(charPosition, Math.Min(charMemory.Length - charPosition, 1 + buffer.Length / BytesPerChar)).Span);
        var slicePosition = position % BytesPerChar;
        var nRead = Math.Min(buffer.Length, byteSlice.Length - slicePosition);
        byteSlice.Slice(slicePosition, nRead).CopyTo(buffer);
        position += nRead;
        return nRead;
    }

    public override int Read(byte[] buffer, int offset, int count) 
    {
        ValidateBufferArgs(buffer, offset, count);
        return Read(buffer.AsSpan(offset, count));
    }

    public override int ReadByte()
    {
        // Could be optimized.
        Span<byte> span = stackalloc byte[1];
        return Read(span) == 0 ? -1 : span[0];
    }

    public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default)
    {
        EnsureOpen();
        if (cancellationToken.IsCancellationRequested) 
            return ValueTask.FromCanceled<int>(cancellationToken);
        try
        {
            return new ValueTask<int>(Read(buffer.Span));
        }
        catch (Exception exception)
        {
            return ValueTask.FromException<int>(exception);
        }   
    }
    
    public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    {
        ValidateBufferArgs(buffer, offset, count);
        var valueTask = ReadAsync(buffer.AsMemory(offset, count));
        if (!valueTask.IsCompletedSuccessfully)
            return valueTask.AsTask();
        var lastResultTask = _cachedResultTask;
        return (lastResultTask != null && lastResultTask.Result == valueTask.Result) ? lastResultTask : (_cachedResultTask = Task.FromResult<int>(valueTask.Result));
    }

    void EnsureOpen()
    {
        if (position == -1)
            throw new ObjectDisposedException(GetType().Name);
    }
    
    // https://learn.microsoft.com/en-us/dotnet/api/system.io.stream.flush?view=net-5.0
    // In a class derived from Stream that doesn't support writing, Flush is typically implemented as an empty method to ensure full compatibility with other Stream types since it's valid to flush a read-only stream.
    public override void Flush() { }
    public override Task FlushAsync(CancellationToken cancellationToken) => cancellationToken.IsCancellationRequested ? Task.FromCanceled(cancellationToken) : Task.CompletedTask;
    public override bool CanRead => true;
    public override bool CanSeek => false;
    public override bool CanWrite => false;
    public override long Length => throw new NotSupportedException();
    public override long Position { get => throw new NotSupportedException(); set => throw new NotSupportedException(); }
    public override long Seek(long offset, SeekOrigin origin) => throw new NotSupportedException();
    public override void SetLength(long value) => throw new NotSupportedException();
    public override void Write(byte[] buffer, int offset, int count) =>  throw new NotSupportedException();
    
    protected override void Dispose(bool disposing)
    {
        try 
        {
            if (disposing) 
            {
                _cachedResultTask = null;
                charMemory = default;
                position = -1;
            }
        }
        finally 
        {
            base.Dispose(disposing);
        }
    }   
    
    static void ValidateBufferArgs(byte[] buffer, int offset, int count)
    {
        if (buffer == null)
            throw new ArgumentNullException(nameof(buffer));
        if (offset < 0 || count < 0)
            throw new ArgumentOutOfRangeException();
        if (count > buffer.Length - offset)
            throw new ArgumentException();
    }
}

Notes:

You can stream from either a string, a char [] array, or slices thereof by converting them to ReadOnlyMemory<char> buffers. This conversion simply wraps the underlying string or array memory without allocating anything.
Solutions that use Encoding.GetBytes() to encode chunks of a string are broken because they will not handle surrogate pairs that are split between chunks. To handle surrogate pairs correctly, Encoding.GetEncoder() must be called to initially save a Encoder. Later, Encoder.GetBytes(ReadOnlySpan<Char>, Span<Byte>, flush: false) can be used to encode in chucks and remember state between calls.

(Microsoft's TranscodingStream does this correctly.)
You will get the best performance by using Encoding.Unicode as (on almost all .Net platforms) this encoding is identical to the encoding of the String type itself.

When a platform-compatible Unicode encoding is supplied no TranscodingStream is used and the returned Stream reads from the character data buffer directly.
To do:
- Test on big-endian platforms (which are rare).
- Test on strings longer than int.MaxValue / 2.

Demo fiddle including some basic tests here.

C.Evenhuis · Answer 3 · 2014-10-02T19:59:32.750

1

You can prevent having to maintain a copy of the whole thing, but you would be forced to use an encoding that results in the same number of bytes for each character. That way you could provide chunks of data via Encoding.GetBytes(str, strIndex, byteCount, byte[], byteIndex) as they're being requested straight into the read buffer.

There will always be one copy action per Stream.Read() operation, because it lets the caller provide the destination buffer.

edited Oct 02 '14 at 19:59

answered Oct 02 '14 at 19:37

C.Evenhuis

25,996
2
58
72

1

"If the data to be converted is available only in sequential blocks (such as data read from a stream) or if the amount of data is so large that it needs to be divided into smaller blocks, the application should use the Decoder or the Encoder provided by the GetDecoder method or the GetEncoder method, respectively, of a derived class." – Ben Voigt Oct 02 '14 at 19:40

Peter Ritchie · Answer 4 · 2014-10-02T20:18:18.440

Stream can only copy data. In addition, it deals in bytes, not chars so you'll have to copy data via the decoding process. But, If you want to view a string as a stream of ASCII bytes, you could create a class that implements Stream to do it. For example:

public class ReadOnlyStreamStringWrapper : Stream
{
    private readonly string theString;

    public ReadOnlyStreamStringWrapper(string theString)
    {
        this.theString = theString;
    }

    public override void Flush()
    {
        throw new NotSupportedException();
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        switch (origin)
        {
            case SeekOrigin.Begin:
                if(offset < 0 || offset >= theString.Length)
                    throw new InvalidOperationException();

                Position = offset;
                break;
            case SeekOrigin.Current:
                if ((Position + offset) < 0)
                    throw new InvalidOperationException();
                if ((Position + offset) >= theString.Length)
                    throw new InvalidOperationException();

                Position += offset;
                break;
            case SeekOrigin.End:
                if ((theString.Length + offset) < 0)
                    throw new InvalidOperationException();
                if ((theString.Length + offset) >= theString.Length)
                    throw new InvalidOperationException();
                Position = theString.Length + offset;
                break;
        }

        return Position;
    }

    public override void SetLength(long value)
    {
        throw new NotSupportedException();
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        return Encoding.ASCII.GetBytes(theString, (int)Position, count, buffer, offset);
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotSupportedException();
    }

    public override bool CanRead
    {
        get { return true; }
    }

    public override bool CanSeek
    {
        get { return true; }
    }

    public override bool CanWrite
    {
        get { return false; }
    }

    public override long Length
    {
        get { return theString.Length; }
    }

    public override long Position { get; set; }
}

But, that's a lot of work to avoid "copying" data...

What would `seek` mean here. For ex, Seek(2, fromstart) would mean seek to byte 2 or to char 2. (BTW: for ex, UTF8 has variable length of chars) — L.B, Oct 02 '14 at 20:08
@L.B as I detailed in the answer, *if you want to view a string as a stream of UTF-8 bytes*; which means it would be `byte`s, not `char`s and thus byte 2. — Peter Ritchie, Oct 02 '14 at 20:10
For ex, this throws exception `var stream = new ReadOnlyStreamStringWrapper("Ça"); var buf = new byte[1]; var read = stream.Read(buf, 0, 1);` — L.B, Oct 02 '14 at 20:14
"A lot of work".. I guess that depends on how big the string is or how many concurrent operations there are going on. I think you can make this work independently of encoding types by using the GetByteCount function to calculate the actual length in bytes of the string. — Jeremy, Oct 02 '14 at 21:03
@Jeremy What about to write it yourself instead of instructing people how to do it? `Is there a way to do this without writing new stream classes?` I think you know there isn't better way of doing it and just want someone to do it for you — L.B, Oct 02 '14 at 21:14

Reading string as a stream without copying

4 Answers4

Linked

Related