0

I want that separate Async threads of method splitFile should run so that the task will become faster but below code is not working. When I debug , it reaches till line RecCnt = File.ReadAllLines(SourceFile).Length - 1; and comes out. Please help.

public delegate void SplitFile_Delegate(FileInfo file);

static void Main(string[] args)
{ 
     DirectoryInfo d = new DirectoryInfo(@"D:\test\Perf testing Splitter"); //Assuming Test is your Folder
     FileInfo[] Files = d.GetFiles("*.txt"); //Getting Text files

     foreach (FileInfo file in Files)
     {
         SplitFile_Delegate LocalDelegate = new SplitFile_Delegate(SplitFile);

         IAsyncResult R = LocalDelegate.BeginInvoke(file, null, null); //invoking the method
         LocalDelegate.EndInvoke(R);

      }
}

private static void SplitFile(FileInfo file)
{
     try
     {
         String fname;
         //int FileLength;
         int RecCnt;
         int fileCount;
         fname = file.Name;
         String SourceFile = @"D:\test\Perf testing Splitter\" + file.Name;


         RecCnt = File.ReadAllLines(SourceFile).Length - 1;
         fileCount = RecCnt / 10000;

         FileStream fs = new FileStream(SourceFile, FileMode.Open);
         using (StreamReader sr = new StreamReader(fs))
         {
             while (!sr.EndOfStream)
             {
                 String dataLine = sr.ReadLine();
                 for (int x = 0; x < (fileCount + 1); x++)
                 {
                      String Filename = @"D:\test\Perf testing Splitter\Destination Files\" + fname + "_" + x + "by" + (fileCount + 1) + ".txt"; //test0by4
                      using (StreamWriter Writer = file.AppendText(Filename))
                      {
                          for (int y = 0; y < 10000; y++)
                          {
                              Writer.WriteLine(dataLine);
                              dataLine = sr.ReadLine();
                          }
                          Writer.Close();
                      }

                  }
              }
          }
     }
     catch (Exception ex)
     {
          Console.WriteLine(ex.Message);
     }
}
B.K.
  • 9,982
  • 10
  • 73
  • 105
Mandar
  • 369
  • 2
  • 5
  • 11
  • 1
    What do you mean with "it comes out"? Exception? Blocking? – flq May 19 '15 at 05:34
  • It executes successfully but does not give any result(Does not split file).After that line it goes back to Main function , again comes back to Split file with next filename and does the same again. – Mandar May 19 '15 at 06:01
  • 3
    as a side remark: `async != faster` in general – Random Dev May 19 '15 at 06:01
  • btw: please make sure to get a reasonable formatting - that was a horrible mess (no guarantee I fixed everything right) – Random Dev May 19 '15 at 06:07
  • 5
    just the `EndInvoke` right after the `BeginInvoke` is the worst async-antipattern I ever saw - what are you trying to do here? – Random Dev May 19 '15 at 06:08
  • 2
    Is your code actually fully saturating the CPU core? I'm pretty sure you're going to be limited by the I/O, not the CPU load (unless you have a rather large SSD RAID array :)). Throwing more threads on the problem isn't going to help - the best you can do is eliminate the CPU time between the `ReadAllLines` and the actual work. But you'll save a lot more if you just *don't read the file twice*. You already have all the lines in the memory, so why do you throw them away and process the file again? Using such silly read pattern? `ReadLine` returns `null` on last line, no need to check EOF :) – Luaan May 19 '15 at 06:24
  • @Luaan: I use ReadAllLines because I want to generate next file as soon as record count reaches to 10K in each file. I am running outer for loop of X for generating new file and I think it should know its upper limit to run that for loop. Please help with better approach. – Mandar May 19 '15 at 06:46

2 Answers2

3

Your code doesn't really need any multi-threading. It doesn't really even need asynchronous processing all that much - you're saturating the I/O most likely, and unless you've got multiple drives as the data sources, you're not going to improve that by adding parallelism.

On the other hand, your code is reading each file twice. For no reason, wasting memory, time and even CPU. Instead, just do this:

FileStream fs = new FileStream(SourceFile, FileMode.Open);
using (StreamReader sr = new StreamReader(fs))
{
    string line;
    string fileName = null; 
    StreamWriter outputFile = null;
    int lineCounter = 0;
    int outputFileIndex = 0;

    while ((line = sr.ReadLine()) != null)
    {
        if (fileName == null || lineCounter >= 10000)
        {
            lineCounter = 0;
            outputFileIndex++;
            fileName = @"D:\Output\" + fname + "_" + outputFileIndex + ".txt";

            if (outputFile != null) outputFile.Dispose();
            outputFile = File.AppendText(fileName);
        }

        outputFile.WriteLine(line);
        lineCounter++;
    }
}

If you really need to have the filename in format XOutOfY, you can just rename them afterwards - it's a lot cheaper than reading the source file twice, line after line. Or, if you don't care about keeping the whole file in memory at once, just use the array you got from ReadAllLines and iterate over that, rather than doing the reading all over again.

To make this even easier, you can also use foreach (var line in File.ReadLines(fileName)).

If you really want to make this asynchronous, the way to handle that is by using asynchronous I/O, not just by spooling new threads. So you can use await with StreamReader.ReadLineAsync etc.

Luaan
  • 62,244
  • 7
  • 97
  • 116
  • Error: Can not use unassigned variable for linecounter and outputfile. – Mandar May 19 '15 at 08:39
  • Thanks Luaan. It worked. What these lines are doing exactly: if (outputFile != null) outputFile.Dispose(); outputFile = File.AppendText(fileName); – Mandar May 19 '15 at 08:47
  • @Mandar The first one makes sure the old (and finished) output file is closed properly. The second then creates (well, "appends to" - you might want to rethink that) a new file for the next batch of lines. The rest of the loop then only has to care about writing a single line to the file, without having to care what file it actually is. – Luaan May 19 '15 at 08:50
  • Thanks Luaan. Now I want to place this code inside separate method.(or say thread). and want to run multiple instances of this method simultaneously so that execution time will get reduced.Is there any possibility to achieve this ? – Mandar May 20 '15 at 05:08
  • @Mandar Why would you think running multiple instances of this method simultaneously would reduce the execution time? You're still bottlenecked by the I/O - reading a hundred files at once is always going to be slower than reading them one after another (as long as they're on the same physical hard drive). Avoid multi-threading unless it's absolutely necessary - multi-threading is *hard*. And even then, you'd want to use asynchronous I/O rather than outright multi-threading - there's no point in wasting threads while waiting for I/O work to complete. – Luaan May 20 '15 at 07:08
  • Thanks for this clarification Luaan. I was somehow under impression that parallelism would improve the performance in this scenario. – Mandar May 20 '15 at 09:03
  • @Mandar Parallelism can only ever help you if you actually have multiple resources to work with. If you've got two separate physical disk drives, reading from both in parallel is going to work nicely. If you don't, you're still limited by the capabilities of the one drive. This is similar on the CPU - if you've got one CPU core, adding more threads isn't going to make anything faster. If you've got 4 cores, there's little point in using 20 CPU-bound threads for your work - they will not work faster than 4 similar threads. This is somewhat oversimplified, of course. – Luaan May 20 '15 at 09:10
1

You are not required to call EndInvoke and really all EndInvoke does is wait on the return value for you. Since SplitFile returns void, my guess is there's an optimization that kicks in because you don't need to wait on anything and it simply ignores the wait. For more details: C# Asynchronous call without EndInvoke?

That being said, your usage of Begin/EndInvoke will likely not be faster than serial programming (and will likely be marginally slower) as your for loop is still serialized, and you're still running the iteration in serial. All that has changed is you're using a delegate where it looks like one isn't necessary.

It's possible that what you meant to use was Parallel.ForEach (MSDN: https://msdn.microsoft.com/en-us/library/dd992001(v=vs.110).aspx) which will potentially run iterations in parallel.

Edit: As someone else has mentioned, having multiple threads engage in file operations will likely not improve performance as your file ops are probably disk bound. The main benefit you would get from an async file read/write would probably be unblocking the main thread for a UI update. You will need to specify what you want with "performance" if you want a better answer.

Community
  • 1
  • 1
Jun
  • 764
  • 5
  • 8
  • What I want ideally is , all threads of file splitting should run independently and in parallel. – Mandar May 19 '15 at 06:51