28

We have code like:

ms = New IO.MemoryStream
bin = New System.Runtime.Serialization.Formatters.Binary.BinaryFormatter
bin.Serialize(ms, largeGraphOfObjects)
dataToSaveToDatabase = ms.ToArray()
// put dataToSaveToDatabase in a Sql server BLOB

But the memory steam allocates a large buffer from the large memory heap that is giving us problems. So how can we stream the data without needing enough free memory to hold the serialized objects.

I am looking for a way to get a Stream from SQL server that can then be passed to bin.Serialize() so avoiding keeping all the data in my processes memory.

Likewise for reading the data back...


Some more background.

This is part of a complex numerical processing system that processes data in near real time looking for equipment problems etc, the serialization is done to allow a restart when there is a problem with data quality from a data feed etc. (We store the data feeds and can rerun them after the operator has edited out bad values.)

Therefore we serialize the object a lot more often then we de-serialize them.

The objects we are serializing include very large arrays mostly of doubles as well as a lot of small “more normal” objects. We are pushing the memory limit on 32 bit systems and make the garbage collector work very hard. (Effects are being made elsewhere in the system to improve this, e.g. reusing large arrays rather then create new arrays.)

Often the serialization of the state is the last straw that causes an out of memory exception; the peak of our memory usage is always during this serialization step.

I think we get large memory pool fragmentation when we de-serialize the object, I expect there are also other problems with large memory pool fragmentation given the size of the arrays. (This has not yet been investigated, as the person that first looked at this is a numerical processing expert, not a memory management expert.)

Our customers use a mix of SQL Server 2000, 2005 and 2008 and we would rather not have different code paths for each version of SQL Server if possible.

We can have many active models at a time (in different processes, across many machines), each model can have many saved states. Hence the saved state is stored in a database blob rather then a file.

As the spread of saving the state is important, I would rather not serialize the object to a file, and then put the file in a BLOB one block at a time.

Other related questions I have asked

bouvierr
  • 3,563
  • 3
  • 27
  • 32
Ian Ringrose
  • 51,220
  • 55
  • 213
  • 317
  • "But the memory steam allocates a large buffer from the large memory heap that is giving us problems" - could you expand upon that. What problems? out of memory etc.? – Mitch Wheat Jan 23 '10 at 06:27
  • @Mitch, we are running out of memory when serializing the object, we *think* we are also getting problem with large memory pool fragmentation when desterilizing the objects. Some of the objects are VERY large arrays. – Ian Ringrose Jan 23 '10 at 11:21

7 Answers7

40

There is no built-in ADO.Net functionality to handle this really gracefully for large data. The problem is two fold:

  • there is no API to 'write' into a SQL command(s) or parameters as into a stream. The parameter types that accept a stream (like FileStream) accept the stream to READ from it, which does not agree with the serialization semantics of write into a stream. No matter which way you turn this, you end up with a in memory copy of the entire serialized object, bad.
  • even if the point above would be solved (and it cannot be), the TDS protocol and the way SQL Server accepts parameters do not work well with large parameters as the entire request has to be first received before it is launched into execution and this would create additional copies of the object inside SQL Server.

So you really have to approach this from a different angle. Fortunately, there is a fairly easy solution. The trick is to use the highly efficient UPDATE .WRITE syntax and pass in the chunks of data one by one, in a series of T-SQL statements. This is the MSDN recommended way, see Modifying Large-Value (max) Data in ADO.NET. This looks complicated, but is actually trivial to do and plug into a Stream class.


The BlobStream class

This is the bread and butter of the solution. A Stream derived class that implements the Write method as a call to the T-SQL BLOB WRITE syntax. Straight forward, the only thing interesting about it is that it has to keep track of the first update because the UPDATE ... SET blob.WRITE(...) syntax would fail on a NULL field:

class BlobStream: Stream
{
    private SqlCommand cmdAppendChunk;
    private SqlCommand cmdFirstChunk;
    private SqlConnection connection;
    private SqlTransaction transaction;

    private SqlParameter paramChunk;
    private SqlParameter paramLength;

    private long offset;

    public BlobStream(
        SqlConnection connection,
        SqlTransaction transaction,
        string schemaName,
        string tableName,
        string blobColumn,
        string keyColumn,
        object keyValue)
    {
        this.transaction = transaction;
        this.connection = connection;
        cmdFirstChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
    SET [{2}] = @firstChunk
    WHERE [{3}] = @key"
            ,schemaName, tableName, blobColumn, keyColumn)
            , connection, transaction);
        cmdFirstChunk.Parameters.AddWithValue("@key", keyValue);
        cmdAppendChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
    SET [{2}].WRITE(@chunk, NULL, NULL)
    WHERE [{3}] = @key"
            , schemaName, tableName, blobColumn, keyColumn)
            , connection, transaction);
        cmdAppendChunk.Parameters.AddWithValue("@key", keyValue);
        paramChunk = new SqlParameter("@chunk", SqlDbType.VarBinary, -1);
        cmdAppendChunk.Parameters.Add(paramChunk);
    }

    public override void Write(byte[] buffer, int index, int count)
    {
        byte[] bytesToWrite = buffer;
        if (index != 0 || count != buffer.Length)
        {
            bytesToWrite = new MemoryStream(buffer, index, count).ToArray();
        }
        if (offset == 0)
        {
            cmdFirstChunk.Parameters.AddWithValue("@firstChunk", bytesToWrite);
            cmdFirstChunk.ExecuteNonQuery();
            offset = count;
        }
        else
        {
            paramChunk.Value = bytesToWrite;
            cmdAppendChunk.ExecuteNonQuery();
            offset += count;
        }
    }

    // Rest of the abstract Stream implementation
 }

Using the BlobStream

To use this newly created blob stream class you plug into a BufferedStream. The class has a trivial design that handles only writing the stream into a column of a table. I'll reuse a table from another example:

CREATE TABLE [dbo].[Uploads](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [FileName] [varchar](256) NULL,
    [ContentType] [varchar](256) NULL,
    [FileData] [varbinary](max) NULL)

I'll add a dummy object to be serialized:

[Serializable]
class HugeSerialized
{
    public byte[] theBigArray { get; set; }
}

Finally, the actual serialization. We'll first insert a new record into the Uploads table, then create a BlobStream on the newly inserted Id and call the serialization straight into this stream:

using (SqlConnection conn = new SqlConnection(Settings.Default.connString))
{
    conn.Open();
    using (SqlTransaction trn = conn.BeginTransaction())
    {
        SqlCommand cmdInsert = new SqlCommand(
@"INSERT INTO dbo.Uploads (FileName, ContentType)
VALUES (@fileName, @contentType);
SET @id = SCOPE_IDENTITY();", conn, trn);
        cmdInsert.Parameters.AddWithValue("@fileName", "Demo");
        cmdInsert.Parameters.AddWithValue("@contentType", "application/octet-stream");
        SqlParameter paramId = new SqlParameter("@id", SqlDbType.Int);
        paramId.Direction = ParameterDirection.Output;
        cmdInsert.Parameters.Add(paramId);
        cmdInsert.ExecuteNonQuery();

        BlobStream blob = new BlobStream(
            conn, trn, "dbo", "Uploads", "FileData", "Id", paramId.Value);
        BufferedStream bufferedBlob = new BufferedStream(blob, 8040);

        HugeSerialized big = new HugeSerialized { theBigArray = new byte[1024 * 1024] };
        BinaryFormatter bf = new BinaryFormatter();
        bf.Serialize(bufferedBlob, big);

        trn.Commit();
    }
}

If you monitor the execution of this simple sample you'll see that nowhere is a large serialization stream created. The sample will allocate the array of [1024*1024] but that is for demo purposes to have something to serialize. This code serializes in a buffered manner, chunk by chunk, using the SQL Server BLOB recommended update size of 8040 bytes at a time.

user247702
  • 23,641
  • 15
  • 110
  • 157
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • thanks, I did not think of using BufferedStream of doing the hard work of buffering. – Ian Ringrose Jan 28 '10 at 08:02
  • What is the "bytesToWrite = new MemoryStream(buffer, index, count).ToArray();" for? I am missing something or could an array of bytes have been allocated instead? – Ian Ringrose Jan 28 '10 at 08:03
  • the memorystream..toarray() is indeed just metal fatique on me. You could, and should, allocate a byte[] of the proper size and copy the bytes into it. – Remus Rusanu Jan 28 '10 at 08:17
  • Also the BinaryWriter declaration is an artifact from when I wrote the code, is obviously not necessary. – Remus Rusanu Jan 28 '10 at 08:20
  • Btw Marc is right, the optimal size is 8040 not 8060: http://msdn.microsoft.com/en-us/library/ms177523.aspx "For best performance, we recommend that data be inserted or updated in chunk sizes that are multiples of 8040 bytes" – Remus Rusanu Jan 29 '10 at 23:19
  • 1
    Caution: I've used this technique, and it worked fine for about a year or so, but now it refuses to work (SQL timeout) now that the size of our BLOB table is approx 12GB. The problem seems to be with the way this method causes SQL Server to allocate space in little chunks at a time, forcing it to copy the data around a lot. I'm wondering if there's a command we could issue at the beginning to initialize the blob to the correct length, but filled with zeroes, and then use `UPDATE .WRITE` to fill it in with the real data. Maybe that would fix this problem. Still working on it. Stay posted. – Daniel Schilling Jul 03 '13 at 15:30
  • 1
    Very good answer. Additional note: You can convert a write/push stream to a read/pull stream with the help of an additional thread. The writer would push into a bounded queue of buffers that the pull stream can read from. That results in true streaming. – usr Sep 17 '13 at 13:17
  • 1
    **I found another problem with this solution.** If you are appending the bytes, SQL Server ignores the length parameter (even if offset is always passed), and the complete buffer data is written, although the length/count parameter is smaller! – Juan Calero Jan 27 '14 at 17:14
  • @JuanCalero how could you tell SQL is writing the complete buffer each time? What tool are you using to show this? – JJS Jun 02 '15 at 22:55
  • Well, it's been a long time, I can't remember.., but I suposse I was simply debugging the code and watching the expected and the actual buffer. As a side note, i love LinqPad to run SQL queries and small C# snippets and tests. Lightweight and really powerful tool! – Juan Calero Jun 03 '15 at 08:14
  • @JSS the [`UPDATE.WRITE`](https://msdn.microsoft.com/en-us/library/ms177523.aspx) spec says: `@Length is the length of the section in the column, starting from @Offset, that is replaced by expression.`. Ie. you can replace 30 bytes with a chunk of 20 bytes. So indeed my code was wrong, it should pass NULL as length truncate the byte array. It is fixed now. – Remus Rusanu Jun 03 '15 at 09:07
  • @RemusRusanu your sqlfiddle doesn't have any code, only a schema definition. – JJS Jun 11 '15 at 17:43
  • firstly awesome answer in so many ways. But i am wondering how a similar streaming approach could be applied to an XML data type? is this different enough to be a new question? – Simon Sep 26 '15 at 06:51
  • XML streaming should go the very same way. `using (XmlWriter writer = XmlWriter.Create(blobstream)) { writer.WriteElementstring('foo','bar'); writer.Flush(); write.Close(); }` – Remus Rusanu Sep 26 '15 at 08:37
11

All you need is .NET Framework 4.5 and streaming. Let's assume we have a big file on HDD and we want to upload this file.

SQL code:

CREATE TABLE BigFiles 
(
    [BigDataID] [int] IDENTITY(1,1) NOT NULL,
    [Data] VARBINARY(MAX) NULL
)

C# code:

using (FileStream sourceStream = new FileStream(filePath, FileMode.Open))
{
    using (SqlCommand cmd = new SqlCommand(string.Format("UPDATE BigFiles SET Data=@Data WHERE BigDataID = @BigDataID"), _sqlConn))
    {
        cmd.Parameters.AddWithValue("@Data", sourceStream);
        cmd.Parameters.AddWithValue("@BigDataID", entryId);

        cmd.ExecuteNonQuery();
    }
}

Works good for me. I have successfully uploaded the file of 400 mb, while MemoryStream throwed an exception when I tried to load this file into memory.

UPD: This code works on Windows 7, but failed on Windows XP and 2003 Server.

  • Is sourceStream supposed to be fs? What is the type of Data column? – user3285954 Apr 13 '15 at 09:24
  • Data is VARBINARY(MAX). sourceStream is equal to fs, sorry for my mistake, will update post – vladimir khozeyev Apr 14 '15 at 09:58
  • Did you validate that data is actually streamed and not read into an array and uploaded in one go? – user3285954 Apr 14 '15 at 15:02
  • I am sure that data is actually streamed. I have just tested this code using avi file of 1.5GB. It was successfully uploaded into DB. There is no way to allocate array of that size. You will get an OutOfMemoryException. – vladimir khozeyev Apr 16 '15 at 06:12
  • 1
    Why is this answer downvoted? According to documentation https://msdn.microsoft.com/en-us/library/hh556234(v=vs.110).aspx you can set SqlParameter.Value to a stream. This is a new feature introduced in .NET Fx 4.5 – Jesús López Apr 16 '15 at 06:33
  • 1
    @vladimirkhozeyev Thank you. Another improvement to post would be to include table schema, i.e. sql script. – user3285954 Apr 16 '15 at 10:05
  • 1
    Be aware that this creates an size-of-file parameter *on the server*. Large files may end up consuming space in `tempdb`. – Remus Rusanu Jun 11 '15 at 19:06
4

You can always write to SQL Server at a lower level using the over the wire protocol TDS (tabular data stream) that Microsoft has used since day one. They are unlikely to change it any time soon as even SQLAzure uses it!

You can see source code of how this works from the Mono project and from the freetds project

Check out the tds_blob

TFD
  • 23,890
  • 2
  • 34
  • 51
4

What does the graph look like?

One problem here is the stream; the SQL 2005 requirement is a pain, as otherwise you could write directly to SqlFileStream, however, I don't think it would be too hard to write your own Stream implementation that buffers 8040 (or some multiple) bytes and writes it incrementally. However, I'm not sure that it is worth this extra complexity - I would be hugely tempted to just use a file as the scratch buffer and then (once serialized) loop over the file inserting/appending chunks. I don't think that the file system is going to hurt your overall performance here, and it will save you starting to write doomed data - i.e. you don't talk to the database until you already know what data you want to write. It will also help you minimise the time the connection is open.

The next problem is the serialization itself. Personally I don't recommend using BinaryFormatter to write to persistent stores (only for transport), since it is implementation specific both in the encoder itself, and in your types (i.e. it is brittle if you make innocent-looking changes to your data types).

If your data can be represented sufficiently as a tree (rather than a full graph), I would be very tempted to try protocol buffers / protobuf-net. This encoding (devised by Google) is smaller than the BinaryFormatter output, faster both for read and write, and is contract-based rather than field-based, so you can reliably rehydrate it again later (even if you switch platform entirely).

The default options mean that it has to write the object-length before each object (which might be expensive in your case), but if you have nested lists of large (deep) objects you can use grouped encoding to avoid this need - allowing it to write the stream in a forwards-only, single-pass way; here's a brief simple example using grouped encoding, but if you want to throw a more complex scenario at me, just let me know...

using System;
using System.Collections.Generic;
using System.IO;
using ProtoBuf;
[ProtoContract]
public class Foo {
    private readonly List<Bar> bars = new List<Bar>();
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public List<Bar> Bars { get { return bars;}}
}
[ProtoContract]
public class Bar {
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public string Name { get; set; }
}
static class Program {
    static void Main() {
        var obj = new Foo { Bars = {
            new Bar { Id = 123, Name = "abc"},
            new Bar { Id = 456, Name = "def"},
        } };
        // write it and show it
        using (MemoryStream ms = new MemoryStream()) {
            Serializer.Serialize(ms, obj);
            Console.WriteLine(BitConverter.ToString(ms.ToArray()));
        }
    }
}

Note: I do have some theories on how to hack Google's wire format to support full graphs, but it is going to need some time to try it. Oh, re the "very large arrays" - for primitive types (not objects) yuo can use "packed" encoding for this; [DataMember(..., Options = MemberSerializationOptions.Packed)] - might be useful, but hard to say without visibility of your model.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • At least **we** don’t get any problems with versioning as the saved state does not need to be able to read by a new version of the software so we can cope with the fact that BinaryFormatter is brittle. – Ian Ringrose Jan 27 '10 at 07:52
2

Why not implement your own system::io:stream derived class? which would allow you to attach it to the SQL column directly via UpdateText for writing.

eg (pseudo-code)

Insert DB Record with blob column 'initialized' (see above UpdateText article)
Create Your Stream Type / Associate DB connection with the stream
Pass the stream to the serialize call

It could chunk up (Multiple of 8040 bytes at a time, i presume) the calls to it and on each full buffer pass that to the DB UpdateText call with the proper offset.

On close of the stream you'd flush whatever was left that didn't fill the buffer entirely via UpdateText.

Likewise you could use the same/similar derived stream to allow reading from a DB column, passing that to be deserialized.

Creating a derived Stream is not all that much work - i've done it in C++/CLI to provide interoperability with IStream's -- and if i can do it :)... (i can provide you the C++/CLI stream code i've done as a sample if that would be helpful)

If you put the entire operation (Insert of initial row, calls to update the blob via the stream) into a transaction you would avoid any potential db inconsistencies if the serialization step fails.

Ruddy
  • 1,734
  • 10
  • 11
  • Thanks, I was thinking along this sort of line; however the .net IoStream has lot of methods like "seek" that will be hard to implement. I was hoping for a prewritten, well tested (nunit), open source implementation that may use a background thread to write the chunk to the database. – Ian Ringrose Jan 23 '10 at 13:36
  • Yea thats what i thought when i initially investigated this for IStreams - but in reality only a few of the calls are actually used typically - so you can almost certainly get away with punting on the hard ones (like just throwing a not implemented exception) obviously not a general scenario - but you would probably find in your specific case it would work just fine. Serialization is really a unidirectional linear process - and i'm sure you'd find in reality only Read & Write were ever called. – Ruddy Jan 23 '10 at 13:54
  • A `System.IO.Stream` does not have to support seeking (network stream etc.) Thats why there is a `CanSeek` property. (Or do you mean other sorts of streams?) – Christian Klauser Jan 23 '10 at 14:05
  • @SealedSun, but how can you tell from the API to the 3rd party method if it needs seeking support now or if the next version will need seeking support? I just don’t like interfaces with optional methods! – Ian Ringrose Jan 23 '10 at 14:14
  • IIRC the SQL2K5 drivers effectively allocated memory for the entire BLOb anyway (caching for re-read?), so it's a double edged problem. Not sure how SQL2K8 drivers work? Worth checking before you try this track – TFD Jan 24 '10 at 11:16
  • @Ian Ringrose: Well that should be part of the librarys "contract" (documented). It cannot be enforced by the C#/CLR type checker, true, but there will always be things that cannot be guaranteed statically (array bounds etc.) – Christian Klauser Jan 24 '10 at 23:40
  • @IIRC does the samething happens for a write of a BLOB, given that ADO.NET does not know how big the BLOB will be until you have written the last chunk it will find it hard to preallote the memory. – Ian Ringrose Jan 27 '10 at 07:57
1

I would go with files. Basically use the file system as an intermediate between the SQL Server and your application.

  1. When serializing a large object, serialize it into a FileStream.
  2. To import it into the database instruct the database to use the file directly when saving the data. Would probably look something like this:

    INSERT INTO MyTable ( [MyColumn] ) SELECT b.BulkColumn, FROM OPENROWSET(BULK N'C:\Path To My File\File.ext', SINGLE_BLOB) as b

  3. When reading back the data, instruct the SQL to save the big column back into the file system as a temporary file, which you will delete after deserializing it into memory ( no need to delete it immediately, as possible caching can be done here ). Not really sure what the sql command is for that as I am for sure no DB expert, but I am pretty sure there must be one.

  4. Using again a FileStream object to deserialize it back into memory.

This procedure can be generalized into a helper class to do it, which will know when to delete those temporary files, as you can reuse them if you know for sure that the value of the sql data record hasn't changed.

Ivan Zlatanov
  • 5,146
  • 3
  • 29
  • 45
  • This will not work, as the database may be on a remote machine and trying to get customers to setup a file share as well is a pain. – Ian Ringrose Jan 24 '10 at 17:08
  • How about using CLR Stored Procedure, to overcome the need for file share? Saving the serialized file in the database is easy, and you only need permissions to write to files somewhere - windows temp files for example? And for reading it back, use CLR Stored Procedure and Stream the temporary file back to your application deserialized? – Ivan Zlatanov Jan 24 '10 at 19:03
1

Note that since SQL Server 2012 there's also FileTable what is similar to FILESTREAM except that allows non-transactional access too.

https://msdn.microsoft.com/en-us/library/hh403405.aspx#CompareFileTable

user3285954
  • 4,499
  • 2
  • 27
  • 19
  • Please explain why the answer was downvoted. It is a valid answer to question because allows large amounts of data to be saved to SQL Server (the only limit is disk space). Data is streamed therefore memory usage is minimal. Applies to reading too. The OP may have upgraded from SQL Server 2005 since 2010 when the question was posted, and regardless of this the answer is definitely useful for others who are not constrained by SQL Server version. – user3285954 Apr 16 '15 at 10:15