1

I'm writing large amounts of text data to a FileStream incrementally, and it's rather slow. Would it be faster if I were to instead write the text to a StringBuilder in memory, and then dump the StringBuilder wholesale to the FileStream? I'm thinking that doing so might be able to take advantage of some sort of buffering in the FileStream, but I don't know enough about the workings of the FileStream to be able to judge.

ekolis
  • 6,270
  • 12
  • 50
  • 101
  • 1
    What is 'large amounts of text'? What is slow? Without the code it is a guess... – rene Apr 26 '14 at 20:54
  • Large amounts of text, as in megabytes. Writing the text is slow, but I'm not sure if it's due to the file I/O itself, any overhead from writing it piecemeal, or the data processing that generates the text (it's a serialization routine). – ekolis Apr 26 '14 at 20:56
  • 3
    Get a profiler attached and measure what is the hottest path. Edit your question to include that. And given the fact that you talk about serialization a small code snippet that demonstrates the issue would help. And megabytes is not a 'large amounts of text' for me. If you are writing to a local disk that should take milliseconds. How slow is slow? – rene Apr 26 '14 at 21:03
  • I'm not sure how I can pare this down to a small code snippet - I'm suspecting the problem is mainly with the serialization logic (which uses reflection, or something akin to it), but I wanted to see if I could improve performance in other ways as well. – ekolis Apr 27 '14 at 21:25
  • Fine, I give up. Now why won't the site let me downvote myself... – ekolis Apr 28 '14 at 22:30
  • Well, I don't. That is too easy. You left a valuable comment under your current answer so we make some progress. You are not using one of the stock Serializers? How does your object graph look like? How does your target format look like? – rene Apr 29 '14 at 07:38
  • The object graph is... basically a mess. Got a lot of circular references, so I'm doing reference tracking. The output format is a custom text format - I was going to make it JSON, but had trouble with making that work (and the standard JSON.NET serializer choked on a few things), so I came up with my own format. – ekolis May 02 '14 at 21:16

2 Answers2

3

As we don't have your serialization code and you suspect you can achieve performance gain by using an optimized Stream and/or StringBuilder I did setup the a test rig in LinqPad to build up a list of a class with a long string property and some other properties. That list is serialized to disk.
Size on disk after serialization for the Xml is 115.910.381 bytes (110 MB).

Test Rig

void Main()
{
    var list = new List<Test>();
    for(int k=0;k<100;k++) list.Add(
        new Test { Prop1 = Rnd(), /* random string of 1 MB */
                 Prop2 =k, Prop3=k*k, Prop4= DateTime.Now});
    BinaryFormatter(list);
    DataContractJsonSerializer(list);
    XmlSerializer(list);
    XmlSerializerBuffered(list);
    XmlSerializerMemory(list);
    XmlSerializerStringBuilder(list);
}

As the Xml serializers took the most time I decided to only try different techniques in that variant.

Direct Filestream

void XmlSerializer(List<Test> list)
{
    var sw = new Stopwatch();
    sw.Start();
    var s = new FileStream("c:\\temp\\test.xml", FileMode.Create);
    var x = new XmlSerializer(typeof(List<Test>));
    x.Serialize(s,list);
    s.Close();
    sw.Stop();
    
    sw.Elapsed.Dump("Xml");
}

Buffered Stream

void XmlSerializerBuffered(List<Test> list)
{
    var sw = new Stopwatch();
    sw.Start();
    var s = new FileStream("c:\\temp\\test.xmlbuf", FileMode.Create);
    var b = new BufferedStream(s);
    var x = new XmlSerializer(typeof(List<Test>));
    x.Serialize(b,list);
    b.Close();
    s.Close();
    sw.Stop();
    
    sw.Elapsed.Dump("Xml Buffered");
}

First in MemoryStream, then copy

void XmlSerializerMemory(List<Test> list)
{
    var sw = new Stopwatch();
    sw.Start();
    var s = new FileStream("c:\\temp\\test.xmlmem", FileMode.Create);
    var m = new MemoryStream(1024*1024);  // INITIAL BUFFER SIZE (can and will grow!)
    // also works but is slower: var m = new MemoryStream();     
   var x = new XmlSerializer(typeof(List<Test>));
    x.Serialize(m,list);
    
    m.Position=0;
    m.CopyTo(s);
    m.Close();
    s.Close();
    sw.Stop();
    
    sw.Elapsed.Dump("Xml Mem");
}

StringBuilder

void XmlSerializerStringBuilder(List<Test> list)
{
    var sw = new Stopwatch();
    sw.Start();
    var s = new StreamWriter("c:\\temp\\test.xmlsb");
    var sb = new StringBuilder();
    var m = new StringWriter(sb);
    var x = new XmlSerializer(typeof(List<Test>));
    x.Serialize(m,list);
    s.Write(sb.ToString()); // http://stackoverflow.com/a/5027483/578411
    s.Close();
    sw.Stop();
    sw.Elapsed.Dump("Xml StringBuilder");
}

My Results (on Win7(64bits)/.Net 4.0/x86/4GB/RAID 0+1)

A typical outcome looked like this:

Xml                00:00:01.5116768 
Xml Buffered       00:00:01.3149263 
Xml Mem            00:00:01.2465760 
Xml StringBuilder  00:00:02.1440784 

The variant where all data is first written to a memory stream and then in one go Copy-ed to the stream is always the fastest.
The StringBuilder is always the slowest but there is no overload in the XmlSerializer to directly 'write' to the StrigBuilder. Hence the use of a StringWriter as extra indirection and that takes time.

Now keep in mind that this just ugly non-optimized test code just to get an idea of which might work. Only optimize based on actual performance data in your setup with your data. Change one thing at the time and keep measuring.

Data Class

[Serializable]
public class Test
{
     public string Prop1 {get; set;}
     public int Prop2 {get;set;}
     public double Prop3 {get;set;}
     public DateTime Prop4 {get;set;}
     
}
Community
  • 1
  • 1
rene
  • 41,474
  • 78
  • 114
  • 152
  • Thanks, that was really helpful! I see that (at least in your scenario) using a StringBuilder actually made the problem *worse*. I suppose with a MemoryStream, you have to know the size of the data beforehand, which I won't, so I won't be able to use that. But a BufferedStream looks like it might be worthwhile to use... – ekolis May 02 '14 at 21:09
  • @ekolis Your assumption on the MemoryStream is wrong. It grows dynamically. I gave it a large initial buffer to start with so it didn't need to grow the internal buffer form the start. I would personally give the memorystream the first try. I added that to my answer. – rene May 03 '14 at 06:42
0

StringBuilder might help you reduce the amount of strings that are created. If you have a lot of unique strings that you write to your filestream, that can create some real overhead. Same discussion as when you talk about string concatenation (see Most efficient way to concatenate strings?). In that case the bad performance you perceive might not be the FileStream.

Best thing to do? Run a Performance Analysis tool.

Community
  • 1
  • 1
Richard
  • 627
  • 7
  • 14
  • I've tried running a performance analysis tool, but the results it gives me are pretty useless, because the serialization code is recursively parsing the object graph, so it just shows a function calling itself a lot. I'm guessing since there aren't many other things it's calling that the function itself is slow, but I'm not entirely sure... – ekolis Apr 27 '14 at 21:26