22

Code:

static void MultipleFilesToSingleFile(string dirPath, string filePattern, string destFile)
{
    string[] fileAry = Directory.GetFiles(dirPath, filePattern);

    Console.WriteLine("Total File Count : " + fileAry.Length);

    using (TextWriter tw = new StreamWriter(destFile, true))
    {
        foreach (string filePath in fileAry)
        {
            using (TextReader tr = new StreamReader(filePath))
            {
                tw.WriteLine(tr.ReadToEnd());
                tr.Close();
                tr.Dispose();
            }
            Console.WriteLine("File Processed : " + filePath);
        }

        tw.Close();
        tw.Dispose();
    }
}

I need to optimize this as its extremely slow: takes 3 minutes for 45 files of average size 40 — 50 Mb XML file.

Please note: 45 files of an average 45 MB is just one example, it can be n numbers of files of m size, where n is in thousands & m can be of average 128 Kb. In short, it can vary.

Could you please provide any views on optimization?

Pratik
  • 11,534
  • 22
  • 69
  • 99
  • 3
    45 files of an average 45MB each is a total of just over 2GB. How long do you expect that to take? Disk I/O will account for a large chunk of the time it's taking. – Ken White Jan 25 '13 at 15:32
  • Do you need to wait this method done? If not, try async – cuongle Jan 25 '13 at 15:32
  • 3
    Calling `Dispose` is superfluous as the objects you're disposing are already in a using block (which will take care of Dispose for you). – Tim Jan 25 '13 at 15:32
  • 1
    You're loading in memory each file. Such big strings will go in the large objects heap, why don't you read smaller chunks of data (reusing the buffer)? Close/Dispose are useless because of the using statement. A raw Stream is enough because you do not handle/change any encoding. After done all of this...you'll see performance won't be changed too much because probably most of the time is spent in I/O. If output file isn't on the same disk as inputs then you may even try to make reading and writing asynchronous (pre-read next file/chunk when writing). – Adriano Repetti Jan 25 '13 at 15:39
  • @KenWhite 45 files of an average 45MB is just one example, it can be 'n' numbers of files of 'm' size, where n is in thousands & m can be of avg. 128Kb. In short it can vary. – Pratik Jan 25 '13 at 16:01
  • @CuongLe Nope, No waiting at all.I just need to do the mentioned activity in minimum time, that's the optmization I'm looking for – Pratik Jan 25 '13 at 16:02
  • You missed my point. :-) Again, disk I/O is going to be a very large part of the time taken, and the larger `n` is the longer it's going to take for just the disk i/o. On top of that, you have the actual overhead of object creation, memory allocation, GC, and so forth. – Ken White Jan 25 '13 at 16:03
  • @KenWhite yes true,so any thing like parallel read is possible ensuring writing order remains unchanged. Will that be quicker? – Pratik Jan 25 '13 at 16:06
  • @Tim Thanks, Dispose is considered a best practice so I had kept, will definitely remove it now. Thanks for the clarifications! – Pratik Jan 25 '13 at 16:08
  • 1
    @Pratik one last note: if you may have 1000+ files you may consider to use Directory.EnumerateFiles instead of Directory.GetFiles. For the same reason I suggest you check the file size to decide which copy method is better (one single big read or multiple small chunks). Finally do not use _helper_ function AppendAllText: it opens and close the file for each writing. – Adriano Repetti Jan 25 '13 at 16:17
  • @Adriano Is it advisable to use some pointers & unsafe code to fasten the process? – Pratik Jan 25 '13 at 16:29
  • 1
    @Pratik no, most of time is spent on (slow) disk I/O, you won't gain anything to use unsafe code. It's better to just refactor your code to do not waste memory/CPU and improve algorithm (ok, even multithreading for I/O is somehow empirical). Well, you may consider to rewrite your code to use ReadFileScatter and WriteFileGather but frankly speaking I don't know how much performance boost you'll have (compared to the effort to use them, at least until very high speed SSDs will be common enough). – Adriano Repetti Jan 25 '13 at 16:44
  • Possible duplicate of [Efficient way to combine multiple text files](http://stackoverflow.com/questions/6311358/efficient-way-to-combine-multiple-text-files) – Liam Mar 30 '17 at 14:28

6 Answers6

53

General answer

Why not just use the Stream.CopyTo(Stream destination) method?

private static void CombineMultipleFilesIntoSingleFile(string inputDirectoryPath, string inputFileNamePattern, string outputFilePath)
{
    string[] inputFilePaths = Directory.GetFiles(inputDirectoryPath, inputFileNamePattern);
    Console.WriteLine("Number of files: {0}.", inputFilePaths.Length);
    using (var outputStream = File.Create(outputFilePath))
    {
        foreach (var inputFilePath in inputFilePaths)
        {
            using (var inputStream = File.OpenRead(inputFilePath))
            {
                // Buffer size can be passed as the second argument.
                inputStream.CopyTo(outputStream);
            }
            Console.WriteLine("The file {0} has been processed.", inputFilePath);
        }
    }
}

Buffer size adjustment

Please, note that the mentioned method is overloaded.

There are two method overloads:

  1. CopyTo(Stream destination).
  2. CopyTo(Stream destination, int bufferSize).

The second method overload provides the buffer size adjustment through the bufferSize parameter.

  • How can we write distinct value into the file, suppose textFile1.text has rows like "test, test, test" and "abc, pqr, xyz" and textFile2.text has rows like "test, test, test" and "pqr, xyz, abcde" so in textFile3.text should have rows like "test, test, test", "abc, pqr, xyz", "pqr, xyz, abcde" – Rocky Jul 06 '16 at 08:02
  • @Rocky, could you please create the appropriate question and provide the link to the question? – Sergey Vyacheslavovich Brunov Jul 06 '16 at 15:54
  • @SergeyBrunov how can I separate this 'single file' to get the files back ? – mrid Aug 30 '18 at 10:27
  • @mrid, feel free to create a separate question here, on Stack Overflow. Long story short, you need to store the metadata somewhere. The metadata may be represented as the table of contents: the offset of each combined file within the resulting (single) one. – Sergey Vyacheslavovich Brunov Sep 01 '18 at 19:45
  • It's not working with video (webm extension) files. And also not giving any error – Aamir Nakhwa May 02 '19 at 13:08
  • @AamirNakhwa, this is because here by «combination» we mean the plain (straightforward) concatenation of the [input] files, i.e. without taking into account a particular file format (file format specifics). – Sergey Vyacheslavovich Brunov May 03 '20 at 08:42
3

One option is to utilize the copy command, and let it do what is does well.

Something like:

static void MultipleFilesToSingleFile(string dirPath, string filePattern, string destFile)
{
    var cmd = new ProcessStartInfo("cmd.exe", 
        String.Format("/c copy {0} {1}", filePattern, destFile));
    cmd.WorkingDirectory = dirPath;
    cmd.UseShellExecute = false;
    Process.Start(cmd);
}
Eren Ersönmez
  • 38,383
  • 7
  • 71
  • 92
  • Will this work? My requirement is, I have directory it has 100 of files, 50 of them *.xml ones I need to combine all of them under one file.Will the above thing work for me? – Pratik Jan 25 '13 at 16:12
  • Oops then I guess this NOT what I'm looking for! By the if it copies all contents of file to a single file.It may work for me, Is that the casE? – Pratik Jan 25 '13 at 16:14
  • @KenWhite Thanks for your explanations, I did search for ms-dos to do this activity but couldn't find So I choose to write a .NET code.But do let me know if its possible in MS DOS commands (IFF it provides better performance than .NET approach. Thank you – Pratik Jan 25 '13 at 16:28
  • 1
    Just add the `/b` switch to force `copy` to treat them as **binary** files (then it'll **append** them). If you need a command line solution this is good (it's not the _best_ solution from performance point of view but the effort to make this good is pretty high). – Adriano Repetti Jan 25 '13 at 16:55
  • @Ken White This does append each file into a single file. Tested. – Eren Ersönmez Jan 25 '13 at 18:18
  • 1
    @Eren: I stand corrected. This must be a change in `cmd.exe` I hadn't caught. I'll remove my comments - luckily I didn't downvote. :-) Thanks for the correction; I always like learning things, even if I'm proven wrong in the process. (And +1, while I'm at it.) – Ken White Jan 25 '13 at 18:34
  • 1
    Launch command line utility to combine the content of files using C#? Are you kidding? – Sergey Vyacheslavovich Brunov Jan 25 '13 at 20:49
  • 3
    This is a LAME approach. I doubt it it would fare better than OP's code, it involves launching a new process which may have overhead, there's no decent error handling option (exit code is not a good option). Besides that it looks archaic. Lame. – Sten Petrov Jan 25 '13 at 22:03
  • 2
    I would never launch a process with unsanitized inputs as parameters. – Eric J. May 02 '17 at 22:13
2

I would use a BlockingCollection to read so you can read and write concurrently.
Clearly should write to a separate physical disk to avoid hardware contention. This code will preserve order.
Read is going to be faster than write so no need for parallel read.
Again since read is going to be faster limit the size of the collection so read does not get farther ahead of write than it needs to.
A simple task to read the single next in parallel while writing the current has the problem of different file sizes - write a small file is faster than read a big.

I use this pattern to read and parse text on T1 and then insert to SQL on T2.

public void WriteFiles()
{
    using (BlockingCollection<string> bc = new BlockingCollection<string>(10))
    {
        // play with 10 if you have several small files then a big file
        // write can get ahead of read if not enough are queued

        TextWriter tw = new StreamWriter(@"c:\temp\alltext.text", true);
        // clearly you want to write to a different phyical disk 
        // ideally write to solid state even if you move the files to regular disk when done
        // Spin up a Task to populate the BlockingCollection
        using (Task t1 = Task.Factory.StartNew(() =>
        {
            string dir = @"c:\temp\";
            string fileText;      
            int minSize = 100000; // play with this
            StringBuilder sb = new StringBuilder(minSize);
            string[] fileAry = Directory.GetFiles(dir, @"*.txt");
            foreach (string fi in fileAry)
            {
                Debug.WriteLine("Add " + fi);
                fileText = File.ReadAllText(fi);
                //bc.Add(fi);  for testing just add filepath
                if (fileText.Length > minSize)
                {
                    if (sb.Length > 0)
                    { 
                       bc.Add(sb.ToString());
                       sb.Clear();
                    }
                    bc.Add(fileText);  // could be really big so don't hit sb
                }
                else
                {
                    sb.Append(fileText);
                    if (sb.Length > minSize)
                    {
                        bc.Add(sb.ToString());
                        sb.Clear();
                    }
                }
            }
            if (sb.Length > 0)
            {
                bc.Add(sb.ToString());
                sb.Clear();
            }
            bc.CompleteAdding();
        }))
        {

            // Spin up a Task to consume the BlockingCollection
            using (Task t2 = Task.Factory.StartNew(() =>
            {
                string text;
                try
                {
                    while (true)
                    {
                        text = bc.Take();
                        Debug.WriteLine("Take " + text);
                        tw.WriteLine(text);                  
                    }
                }
                catch (InvalidOperationException)
                {
                    // An InvalidOperationException means that Take() was called on a completed collection
                    Debug.WriteLine("That's All!");
                    tw.Close();
                    tw.Dispose();
                }
            }))

                Task.WaitAll(t1, t2);
        }
    }
}

BlockingCollection Class

paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • 1
    If input and output come from the same disk then each read will have to wait (or it'll be slow because of) the writing... – Adriano Repetti Jan 25 '13 at 16:21
  • too much code for too little of a task. multithreading won't help split the disk RW head in two – Sten Petrov Jan 25 '13 at 22:05
  • @StenPetrov What part of "Clearly should write to a separate physical disk to avoid hardware contention" was not clear? – paparazzo Jan 25 '13 at 22:25
  • @Blam so on top of what you wrote here we'll have to write another piece that writes to a single disk? – Sten Petrov Jan 25 '13 at 22:31
  • @StenPetrov Code does not fail on a single disk. With read and write caching it will probably even get some parallel. I would not optimize differently for a single disk. So you would do it differently - that is clear from your answer. – paparazzo Jan 25 '13 at 23:22
2

Tried solution posted by sergey-brunov for merging 2GB file. System took around 2 GB of RAM for this work. I have made some changes for more optimization and it now takes 350MB RAM to merge 2GB file.

private static void CombineMultipleFilesIntoSingleFile(string inputDirectoryPath, string inputFileNamePattern, string outputFilePath)
        {
            string[] inputFilePaths = Directory.GetFiles(inputDirectoryPath, inputFileNamePattern);
            Console.WriteLine("Number of files: {0}.", inputFilePaths.Length);
            foreach (var inputFilePath in inputFilePaths)
            {
                using (var outputStream = File.AppendText(outputFilePath))
                {
                    // Buffer size can be passed as the second argument.
                    outputStream.WriteLine(File.ReadAllText(inputFilePath));
                    Console.WriteLine("The file {0} has been processed.", inputFilePath);

                }
            }
        }
kashified
  • 27
  • 6
1

Several things you can do:

  • I my experience the default buffer sizes can be increased with noticeable benefit up to about 120K, I suspect setting a large buffer on all streams will be the easiest and most noticeable performance booster:

    new System.IO.FileStream("File.txt", System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read, 150000);
    
  • Use the Stream class, not the StreamReader class.

  • Read contents into a large buffer, dump them in output stream at once — this will speed up small files operations.
  • No need of the redundant close/dispose: you have the using statement.
Sten Petrov
  • 10,943
  • 1
  • 41
  • 61
0
    // Binary File Copy
    public static void mergeFiles(string strFileIn1, string strFileIn2, string strFileOut, out string strError)
    {
        strError = String.Empty;
        try
        {
            using (FileStream streamIn1 = File.OpenRead(strFileIn1))
            using (FileStream streamIn2 = File.OpenRead(strFileIn2))
            using (FileStream writeStream = File.OpenWrite(strFileOut))
            {
                BinaryReader reader = new BinaryReader(streamIn1);
                BinaryWriter writer = new BinaryWriter(writeStream);

                // create a buffer to hold the bytes. Might be bigger.
                byte[] buffer = new Byte[1024];
                int bytesRead;

                // while the read method returns bytes keep writing them to the output stream
                while ((bytesRead =
                        streamIn1.Read(buffer, 0, 1024)) > 0)
                {
                    writeStream.Write(buffer, 0, bytesRead);
                }
                while ((bytesRead =
                        streamIn2.Read(buffer, 0, 1024)) > 0)
                {
                    writeStream.Write(buffer, 0, bytesRead);
                }
            }
        }
        catch (Exception ex)
        {
            strError = ex.Message;
        }
    }
Miguelito
  • 302
  • 3
  • 11