I've got a problem where I have to process a large batch of large jsonl files (read, deserialize, do some transforms db lookups etc, then write the transformed results in a .net core console app.
I've gotten better throughput by putting the output in batches on a separate thread and was trying to improve the processing side by adding some parallelism but the overhead ended up being self defeating.
I had been doing:
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
for (;;)
{
var l = reader.ReadLine();
if (l == null)
break;
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
And some diagnostic timings showed me that the ReadLine()
call was taking more than the deserialization, etc. To put some numbers on that, a large file would have about:
- 11 seconds spent on ReadLine
- 7.8 seconds spend on serialization
- 10 seconds spent on db lookups
I wanted to overlap that 11 seconds of file i/o with the other work so I tried
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
var nextLine = reader.ReadLineAsync();
for (;;)
{
var l = nextLine.Result;
if (l == null)
break;
nextLine = reader.ReadLineAsync();
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
To get the next I/O going while I did the transform stuff. Only that ended up taking a lot longer than the regular sync stuff (like twice as long).
I've got requirements that they want predictability on the overall result (i.e. the same set of files have to be processed in name order and the output rows have to be predictably in the same order) so I can't just throw a file per thread and let them fight it out.
I was just trying to introduce enough parallelism to smooth the throughput over a large set of inputs, and I was surprised how counterproductive the above turned out to be.
Am I missing something here?