You don't need multiple threads to speed this up.
What you really want is to read the file once, and split it as you go. I don't really understand what you're doing with the min_chunk
and max_chunk
, but what I would suggest is that you define a chunk size, say it's 10,000 lines. You can then do this:
int maxLines = 10,000;
int numLines = 0;
int fileNumber = 0;
var writer = File.CreateText("list" + fileNumber + ".txt");
foreach (var line in File.ReadLines("sort.txt"))
{
writer.WriteLine(line);
++numLines;
if (numLines == maxLines)
{
writer.Close();
numLines = 0;
++fileNumber;
writer = File.Create("list" + fileNumber + ".txt");
}
}
writer.Close();
Using multiple threads to split a single text file usually won't speed things up. For two reasons.
First, if you have 10 threads going, the first thread reads the first N lines and outputs them. At the same time, the second thread is reading the same file, skipping the first N lines and writing the next N lines. With 10 threads, you have the file open 10 times and all but one of the threads is spending most of its time reading and skipping over stuff that it doesn't care about.
Also, the disk can only do one thing at a time. When multiple threads are trying to write to a single disk, it's slower than having a single thread do it. When a single thread is writing to the disk, it can just write ... and write ... and write. When multiple threads are trying to write, one writes, then the disk has to reposition the read/write head before it can write for the next thread, etc. Those repositionings (called head seeks) take a lot of time--on the order of 5 to 10 milliseconds, which is an eternity in CPU time. What happens is that your threads spend most of their time waiting for other threads to write.
Update
If for some reason you're dead set on doing this with multiple threads, you need to fix this loop in your splitFile
method:
for (int currline = min; currline < max; currline++)
{
string line = File.ReadLines("sort.txt").Skip(currline).Take(1).First();
outfile.WriteLine(line);
}
Given that loop and min = 100
and max = 200
, then it's going to read the file 100 times. The first time it will skip 100 lines and output 1. Then it'll close the file and the next time through the loop it'll skip 101 lines and output 1. That's going to take quite a long time.
You can change that to:
foreach (var line in File.ReadLines("sort.txt").Skip(min).Take(max-min))
{
outfile.WriteLine(line);
}
In fact if you really wanted to get fancy, you could write:
File.WriteAllLines(outFileName, File.ReadLines("sort.txt").Skip(min).Take(max-min));
But you still have the problem of multiple threads trying to access the same input file. If File.ReadLines
is opening the file in exclusive mode, then you have two choices:
- use a lock to prevent multiple files from trying to access the file concurrently
- open the file with permissive sharing
An example of option 2:
using (var fs = new FileStream("sort.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var reader = new StreamReader(fs))
{
int i = 0;
while (!reader.EndOfStream && i < max)
{
string line = reader.ReadLine();
if (i > min)
outfile.WriteLine(line);
++i;
}
}
}
That will do what you're asking. It's not a very smart way to do things, though, because you have 10 threads all reading the same file concurrently, and most of them are spending their time skipping over lines. You're doing a whole lot of unnecessary work. The simple single-threaded version that I presented first is going to outperform this, especially if the output files are all on the same drive.