8

I have over 125 TSV files of ~100Mb each that I want to merge. The merge operation is allowed destroy the 125 files, but not the data. What matter is that a the end, I end up with a big file of the content of all the files one after the other (no specific order).

Is there an efficient way to do that? I was wondering if Windows provides an API to simply make a big "Union" of all those files? Otherwise, I will have to read all the files and write a big one.

Thanks!

Martin
  • 39,309
  • 62
  • 192
  • 278
  • PS: have a look here (possible duplicate): http://stackoverflow.com/questions/444309/what-would-be-the-fastest-way-to-concatenate-three-files-in-c – Abel Aug 24 '10 at 13:29

4 Answers4

17

So "merging" is really just writing the files one after the other? That's pretty straightforward - just open one output stream, and then repeatedly open an input stream, copy the data, close. For example:

static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
    using (Stream output = File.OpenWrite(outputFile))
    {
        foreach (string inputFile in inputFiles)
        {
            using (Stream input = File.OpenRead(inputFile))
            {
                input.CopyTo(output);
            }
        }
    }
}

That's using the Stream.CopyTo method which is new in .NET 4. If you're not using .NET 4, another helper method would come in handy:

private static void CopyStream(Stream input, Stream output)
{
    byte[] buffer = new byte[8192];
    int bytesRead;
    while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
    {
        output.Write(buffer, 0, bytesRead);
    }
}

There's nothing that I'm aware of that is more efficient than this... but importantly, this won't take up much memory on your system at all. It's not like it's repeatedly reading the whole file into memory then writing it all out again.

EDIT: As pointed out in the comments, there are ways you can fiddle with file options to potentially make it slightly more efficient in terms of what the file system does with the data. But fundamentally you're going to be reading the data and writing it, a buffer at a time, either way.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I guess your answer to the question is no? – Marcus Johansson Aug 24 '10 at 13:19
  • @Marcus: I guess so... although I wasn't sure that the OP would have been comfortable writing the stream versions above. – Jon Skeet Aug 24 '10 at 13:21
  • Thank you Jon for the help! :) I didn't know about "CopyTo". – Martin Aug 24 '10 at 13:22
  • Great indeed to hear about `CopyTo`, now I can delete my answer ;-) – Abel Aug 24 '10 at 13:27
  • the Copystream method looks a lot like the implementation of CopyTo, is it on purpose ? – dada686 Aug 24 '10 at 13:40
  • @dada686: It wasn't copied, but I'm not surprised if they're similar, given that they have exactly the same purpose and it's a pretty trivial bit of code. – Jon Skeet Aug 24 '10 at 13:44
  • Looking at the kernel level, it's likely that this isn't really the most efficient. You're spending quite a bit of time copying data in memory. Passing FILE_FLAG_NO_BUFFERING to the underlying CreateFile would prevent this. – MSalters Aug 24 '10 at 14:36
  • @MSalters: When you say "quite a bit of time" - isn't that likely to be massively dwarfed by the time spent doing the physical read? Using FileOptions.SequentialScan when creating the input streams may help, but I'd usually go for the simplest approach that worked until I found there to be an actual issue. – Jon Skeet Aug 24 '10 at 15:10
  • Actually, modern disks are becoming quite fast. This applies especially to RAID arrays and SSDs. Furthermore, it looks you'd have not one but two memory copies (to and from the unaligned buffer). By skipping that, you're probably not going to see double-digit performance increases, but 1-10% faster is likely. – MSalters Aug 25 '10 at 08:56
2

Do it from the command line:

copy 1.txt+2.txt+3.txt combined.txt

or

copy *.txt combined.txt
Gabriel Magana
  • 4,338
  • 24
  • 23
  • 1
    You do realize he said **125** files, right? That's going to be very long and tedious to type out. If you gave a C# program to generate the copy string, that might be a *partial* answer. – Aaronaught Aug 24 '10 at 13:21
  • 6
    Dude, then use the second option, with the file mask. Or do a dir command (ie, dir /b to get only filenames), capture the filenames to a file, and construct the command in a good text editor. There are _many_ ways to avoid typing 125 filenames. – Gabriel Magana Aug 24 '10 at 13:23
  • The point is, you didn't even come close to answering the question. You've made a ton of assumptions about the problem domain that you can't possibly know. It's fine to *ask* for more details about the domain but not to simply assume that the question author has chosen an incorrect way of resolving his problem. -1 for your possibly irrelevant solution and your argumentative tone, "dude." – Aaronaught Aug 24 '10 at 18:00
  • 1
    LOL, gotta love self-appointed mods. Chill out. You read too much into things (which is, coincidentally, what you accuse me of; talk about projecting yourself). The OP asked how to combine files, I gave an answer that works. It may fit the problem perfectly or it may not. OP knows if that's the case, _but you do not_. I'm not up for a pissing match though, so this is my last response to you. – Gabriel Magana Aug 24 '10 at 19:09
2

Do you mean with merge that you want to decide with some custom logic what lines go where? Or do you mean that you mainly want to concatenate the files into one big one?

In the case of the latter, it is possible that you don't need to do this programmatically at all, just generate one batch file with this (/b is for binary, remove if not needed):

copy /b "file 1.tsv" + "file 2.tsv" "destination file.tsv"

Using C#, I'd take the following approach. Write a simple function that copies two streams:

void CopyStreamToStream(Stream dest, Stream src)
{
    int bytesRead;

    // experiment with the best buffer size, often 65536 is very performant
    byte[] buffer = new byte[GOOD_BUFFER_SIZE];

    // copy everything
    while((bytesRead = src.Read(buffer, 0, buffer.Length)) > 0)
    {
        dest.Write(buffer, 0, bytesRead);
    }
}

// then use as follows (do in a loop, don't forget to use using-blocks)
CopStreamtoStream(yourOutputStream, yourInputStream);
Abel
  • 56,041
  • 24
  • 146
  • 247
  • @Aaronaught: I was halfway when I submitted, then I wrote the second part. But also, note the little hint in the second para: *"just generate one batch file"*. By generating, I mean: create automatically. But then I decided to add the C# code :) – Abel Aug 24 '10 at 13:25
0

Using a folder of 100MB text files totalling ~12GB, I found that a small time saving could be made over the accepted answer by using File.ReadAllBytes and then writing that out to the stream.

        [Test]
        public void RaceFileMerges()
        {
            var inputFilesPath = @"D:\InputFiles";
            var inputFiles = Directory.EnumerateFiles(inputFilesPath).ToArray();

            var sw = new Stopwatch();
            sw.Start();

            ConcatenateFilesUsingReadAllBytes(@"D:\ReadAllBytesResult", inputFiles);

            Console.WriteLine($"ReadAllBytes method in {sw.Elapsed}");

            sw.Reset();
            sw.Start();

            ConcatenateFiles(@"D:\CopyToResult", inputFiles);

            Console.WriteLine($"CopyTo method in {sw.Elapsed}");
        }

        private static void ConcatenateFiles(string outputFile, params string[] inputFiles)
        {
            using (var output = File.OpenWrite(outputFile))
            {
                foreach (var inputFile in inputFiles)
                {
                    using (var input = File.OpenRead(inputFile))
                    {
                        input.CopyTo(output);
                    }
                }
            }
        }

        private static void ConcatenateFilesUsingReadAllBytes(string outputFile, params string[] inputFiles)
        {
            using (var stream = File.OpenWrite(outputFile))
            {
                foreach (var inputFile in inputFiles)
                {
                    var currentBytes = File.ReadAllBytes(inputFile);
                    stream.Write(currentBytes, 0, currentBytes.Length);
                }
            }
        }

ReadAllBytes method in 00:01:22.2753300

CopyTo method in 00:01:30.3122215

I repeated this a number of times with similar results.

Owen Pauling
  • 11,349
  • 20
  • 53
  • 64