21

I have multiple files of text that I need to read and combine into one file. The files are of varying size: 1 - 50 MB each. What's the most efficient way to combine these files without bumping into the dreading System.OutofMemoryException?

Soner Gönül
  • 97,193
  • 102
  • 206
  • 364
Dave Harding
  • 1,280
  • 2
  • 16
  • 31
  • 1
    Can you describe 'Combine' ? And what is in those files? Just lines of text or CSV or XML or ... – H H Jun 10 '11 at 19:49
  • What kind of combining are you needing to do? If you're just, say, merge-sorting a bunch of sorted files, you won't need to read the whole files into memory, but can just process them line-by-line. – C. K. Young Jun 10 '11 at 19:49
  • 5
    from a command prompt: copy targefile.text – Muad'Dib Jun 10 '11 at 19:50
  • 1
    Yeah... copy file1.txt + file2.txt + file3.txt allfiles.txt – agent-j Jun 10 '11 at 20:05
  • There's a previous discussion of this topic here http://stackoverflow.com/questions/444309/what-would-be-the-fastest-way-to-concatenate-three-files-in-c. Looks like that's a nice approach that will not use as much RAM as looping `ReadAllText` then `WriteAllText`. – Steve Townsend Jun 10 '11 at 20:12
  • 4
    `copy *.txt allfiles.txt` – Lee Englestone Feb 04 '13 at 09:17

4 Answers4

25

Do it in chunks:

const int chunkSize = 2 * 1024; // 2KB
var inputFiles = new[] { "file1.dat", "file2.dat", "file3.dat" };
using (var output = File.Create("output.dat"))
{
    foreach (var file in inputFiles)
    {
        using (var input = File.OpenRead(file))
        {
            var buffer = new byte[chunkSize];
            int bytesRead;
            while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
            {
                output.Write(buffer, 0, bytesRead);
            }
        }
    }
}
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • I have to run to a meeting and might not be able to test for a bit, but I'll get back to you ASAP! - Thanks – Dave Harding Jun 10 '11 at 19:52
  • 1
    The repeated reallocation of, and data copy to, `actual` is redundant. Just write out the number of bytes you know you read (per `bytesread`) directly from `buffer` to the new file. `buffer` itself also only needs to be allocated once, before entering the outer `for` loop. – Steve Townsend Jun 10 '11 at 20:17
  • @Steve Townsend, very good point. I've updated my post to take it into account. – Darin Dimitrov Jun 10 '11 at 21:52
  • Darin, thanks. Much appreciated. 10 files and it doesn't even break a sweat. – Dave Harding Jun 13 '11 at 16:56
  • @DarinDimitrov does this handle unicode files too? what if two files are in a different format? – Baz1nga May 11 '12 at 12:48
  • @Baz1nga It copies things as is, so the encoding doesn't matter. If the files have different encodings though, the resulting file will not be properly displayed by a normal editor. – user276648 Mar 30 '17 at 04:01
  • @DarinDimitrov Slight improvement: do `new byte[chunkSize]` only once instead of for each file, and use `chunkSize` instead of `buffer.length`. – user276648 Mar 30 '17 at 04:02
  • @DarinDimitrov would merging it in memory in parallel (using memory stream) and then writing to the disk, make it any faster? – loneshark99 May 03 '17 at 23:55
23

Darin is on the right track. My tweak would be:

using (var output = File.Create("output"))
{
    foreach (var file in new[] { "file1", "file2" })
    {
        using (var input = File.OpenRead(file))
        {
            input.CopyTo(output);
        }
    }
}
n8wrl
  • 19,439
  • 4
  • 63
  • 103
  • `CopyTo` is a nice one but it's probably worth mentioning that it's only available in .NET 4.0. – Darin Dimitrov Jun 10 '11 at 19:55
  • Oooo - didn't now that. My MSDN defaults to .NET 4 – n8wrl Jun 10 '11 at 19:55
  • how do we get the files back from the combined file ? – KADEM Mohammed Dec 06 '13 at 12:52
  • 1
    @Carter: Could you clarify? The original files still exist – n8wrl Dec 07 '13 at 13:24
  • Yes, in my case i have two files "file.Docx" and "file_Information.Xml" i want the application A for example to merge the two in one single file "file.QAF"... then pass this file to another application B to recover the two files "file.Docx" and "file_Information.Xml" (the way back...) – KADEM Mohammed Dec 07 '13 at 15:29
  • @CarterNolan just make a note of the length of each file (i.e. input.Length) and then pass it on to application B. Inside application B when writing with FileStream.Write, set offset to the starting byte of each file and count to the number of bytes to write. – user797717 Jul 20 '15 at 00:11
1

This is code used above for .Net 4.0, but compatible with .Net 2.0 (for text files)

using (var output = new StreamWriter("D:\\TMP\\output"))
{
  foreach (var file in Directory.GetFiles("D:\\TMP", "*.*"))
  {
    using (var input = new StreamReader(file))
    {
      output.WriteLine(input.ReadToEnd());
    }
  }
}

Please note that this will read the entire file in memory at once. This means that large files will cause a lot of memory to be used (and if not enough memory is available, it may fail all together).

Jasper
  • 11,590
  • 6
  • 38
  • 55
1
copy *.txt <combined_fileName>.txt

I also think this is best approach. Within 3 hours combined 450+ files and with excel removed unwanted records, like file-header, footer etc.

yacc
  • 2,915
  • 4
  • 19
  • 33
Abhijit
  • 11
  • 1