How can I parallel split text file into multiple files in using c#

Question

I've been researching and i found stuff with the Parallel.For but I can't figure out how to code it without some kind of error.

One error i keep getting, is that there are multiple processors trying to access the same file.

I currently have code that is sequential but that takes a long time. My text file is 10GB.

This is my sequential part, I failed in all attempts to parallel it

for (int i = 0; i <= 10; i++)
            {
               Console.WriteLine("Parsing List: " + i);
               min_chunk += chunk;
                max_chunk += chunk;
                if (max_chunk >= lines)
                {
                    max_chunk = lines - 1;
                }
                if (i == 0)
                {
                    min_chunk = 0;
                    max_chunk = chunk;
                }
                int diff = (int)(max_chunk - min_chunk);
                splitFile("sort.txt", min_chunk, max_chunk, i);
            }
public static void splitFile(string path, int min, int max, int threadnum)
        {

            string outFileName = String.Concat("list", threadnum, ".txt");
            System.IO.StreamWriter outfile = new System.IO.StreamWriter(outFileName);


            for (int currline = min; currline < max; currline++)
            {
                string line = File.ReadLines("sort.txt").Skip(currline).Take(1).First();
                outfile.WriteLine(line);
            }

            outfile.Close();
        }
    }

I suggest you post your existing code and we can help with where you are going wrong. — Adrian Sanguineti, Dec 04 '14 at 03:52
I think you would find [this answer](http://stackoverflow.com/a/3527293/2589202) interesting. Especially the part about not getting any speed improvement because you are IO bound. — crthompson, Dec 04 '14 at 04:04
You should do some metrics on the code. Most likely you will find that you are reading the file at the ~max transfer rate of your disk drive. — Aron, Dec 04 '14 at 05:16

score 2 · Answer 1 · edited May 23 '17 at 12:16

2

Here are few links already answered related to your question

edited May 23 '17 at 12:16

Community

1
1

answered Dec 04 '14 at 04:21

zash707

245
2
11

Jim Mischel · Answer 2 · 2014-12-05T19:03:00.273

You don't need multiple threads to speed this up.

What you really want is to read the file once, and split it as you go. I don't really understand what you're doing with the min_chunk and max_chunk, but what I would suggest is that you define a chunk size, say it's 10,000 lines. You can then do this:

int maxLines = 10,000;
int numLines = 0;
int fileNumber = 0;
var writer = File.CreateText("list" + fileNumber + ".txt");
foreach (var line in File.ReadLines("sort.txt"))
{
    writer.WriteLine(line);
    ++numLines;
    if (numLines == maxLines)
    {
        writer.Close();
        numLines = 0;
        ++fileNumber;
        writer = File.Create("list" + fileNumber + ".txt");
    }
}
writer.Close();

Using multiple threads to split a single text file usually won't speed things up. For two reasons.

First, if you have 10 threads going, the first thread reads the first N lines and outputs them. At the same time, the second thread is reading the same file, skipping the first N lines and writing the next N lines. With 10 threads, you have the file open 10 times and all but one of the threads is spending most of its time reading and skipping over stuff that it doesn't care about.

Also, the disk can only do one thing at a time. When multiple threads are trying to write to a single disk, it's slower than having a single thread do it. When a single thread is writing to the disk, it can just write ... and write ... and write. When multiple threads are trying to write, one writes, then the disk has to reposition the read/write head before it can write for the next thread, etc. Those repositionings (called head seeks) take a lot of time--on the order of 5 to 10 milliseconds, which is an eternity in CPU time. What happens is that your threads spend most of their time waiting for other threads to write.

Update

If for some reason you're dead set on doing this with multiple threads, you need to fix this loop in your splitFile method:

        for (int currline = min; currline < max; currline++)
        {
            string line = File.ReadLines("sort.txt").Skip(currline).Take(1).First();
            outfile.WriteLine(line);
        }

Given that loop and min = 100 and max = 200, then it's going to read the file 100 times. The first time it will skip 100 lines and output 1. Then it'll close the file and the next time through the loop it'll skip 101 lines and output 1. That's going to take quite a long time.

You can change that to:

foreach (var line in File.ReadLines("sort.txt").Skip(min).Take(max-min))
{
    outfile.WriteLine(line);
}

In fact if you really wanted to get fancy, you could write:

File.WriteAllLines(outFileName, File.ReadLines("sort.txt").Skip(min).Take(max-min));

But you still have the problem of multiple threads trying to access the same input file. If File.ReadLines is opening the file in exclusive mode, then you have two choices:

use a lock to prevent multiple files from trying to access the file concurrently
open the file with permissive sharing

An example of option 2:

using (var fs = new FileStream("sort.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
{
    using (var reader = new StreamReader(fs))
    {
        int i = 0;
        while (!reader.EndOfStream && i < max)
        {
            string line = reader.ReadLine();
            if (i > min)
                outfile.WriteLine(line);
            ++i;
        }
    }
}

That will do what you're asking. It's not a very smart way to do things, though, because you have 10 threads all reading the same file concurrently, and most of them are spending their time skipping over lines. You're doing a whole lot of unnecessary work. The simple single-threaded version that I presented first is going to outperform this, especially if the output files are all on the same drive.

Thanks for the response Jim, i am not trying to write to the same file. I am trying to read out of one general file(sort.txt) and break that up into n different files. I tried this using openmp on c++ and it worked well but it has been a difficult experience using c# and i cant stop till I get it right. — Bisoye Olaleye, Dec 04 '14 at 09:09
@BisoyeOlaleye: The code I presented *does* break a single file up into n different files. My point about the disk doesn't have anything to do with multiple threads writing to the same file, but multiple threads writing to separate files on the same disk drive. — Jim Mischel, Dec 04 '14 at 15:45

How can I parallel split text file into multiple files in using c#

2 Answers2

Update