1

Lets say that i have a big list of string called "lines" and it holds all lines from a text file (Its usually big numbers around 100k - 1mil lines)

List<string> lines = File.ReadAllLines("Lines.txt");

And my problem is that i would need to split the file (Or the list) based on what chunk size the user inputs. So lets say that we have 10k Lines in Lines.txt and user imputs chunks of 4400 lines

File1 = 4400 Lines 
File2 = 4400 Lines
File3 = 1200 Lines

I tried using something like this that my colleague recommended but i dont understand it and it does not work.

public static class ListExtensions
{
    public static List<List<T>> ChunkBy<T>(this List<T> source, int chunkSize) 
    {
        return source
            .Select((x, i) => new { Index = i, Value = x })
            .GroupBy(x => x.Index / chunkSize)
            .Select(x => x.Select(v => v.Value).ToList())
            .ToList();
    }
}

I would appreciate any recommendations or help on how i could solve this.

NAGA
  • 13
  • 3
  • hmm... you should probably be handling this off a stream. – Brett Caswell Nov 03 '19 at 01:27
  • 1
    Why it doesn't work? It seems fine to me (without trying it, just inspecting it). – Theodor Zoulias Nov 03 '19 at 01:39
  • Also are you sure that you want to load all lines in memory? There are ways to load one line at a time, or one chunk at a time, depending on what you want to do with these lines/chunks ([Create batches in linq](https://stackoverflow.com/questions/13731796/create-batches-in-linq)). – Theodor Zoulias Nov 03 '19 at 01:45
  • 1
    @TheodorZoulias thats a good idea you brought up. Its very true that i shouldn't be storing this in ram because sometimes the files can be massive. – NAGA Nov 03 '19 at 16:54
  • 1
    Take a look at the [`File.ReadLines`](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readlines) method. It reads the file in small chunks (4096 bytes if I remember correctly) while you enumerating the enumerable. From the documentation: *When you use `ReadLines`, you can start enumerating the collection of strings before the whole collection is returned.* – Theodor Zoulias Nov 03 '19 at 17:34
  • 1
    @TheodorZoulias Yeah definetly a performance increase using `File.ReadLines` over `File.ReadAllLines`. I made sure to highlight that in my answer below. – RoadRunner Nov 03 '19 at 22:54
  • 1
    @RoadRunner yeap. For even better performance you can keep reading the file in one thread (or asynchronous workflow) while processing the lines in another thread. The producer consumer pattern in other words. It can by easily implemented with the [TPL Dataflow](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) library. This library also has the block [`BatchBlock`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.dataflow.batchblock-1) to do the batching. – Theodor Zoulias Nov 04 '19 at 00:53

4 Answers4

1

How about this -

var numOfChunks = lines.Count / chunkSize; // initial number of chunks
if(lines.Count % chunkSize > 0) { numOfChunks++; } // add one chunk for remainder if there is a remainder
for (var i = 0; i <= numOfChunks; i++)
{
     var chunk = lines.Skip(i * chunkSize).Take(chunkSize);
    // Do something with chunk, like writing to file
}
Nimrod Dolev
  • 577
  • 3
  • 8
0

Linq's Take and Skip may help here.

public static List<List<T>> ChunkBy<T>(this List<T> source, int chunkSize) 
{
    var pages = new List<List<T>>();
    var page = 0;
    while(source.Any()) {
        var l = source.Skip(page++*chunkSize).Take(chunkSize).ToList();
        if(!l.Any()) break;
        pages.Add(l);
    }
    return pages;
}
tymtam
  • 31,798
  • 8
  • 86
  • 126
0

Here is a version with no Linq.

public static List<List<T>> ChunkBy<T>(List<T> source, int chunkSize) 
{
    var pages = new List<List<T>>();
    var page = new List<T>();
    var i = 0;
    foreach( var s in source ) {
        if((i++ % chunkSize) == 0 ) { page = new List<T>(); pages.Add(page);}
        page.Add(s);
    }

    return pages;
}
tymtam
  • 31,798
  • 8
  • 86
  • 126
0

Here's one method you can use for this task. You need to ensure you pass the sourceFile file that you are chunking, the destDirectory to write the files to, and the chunk size.

private static void ChunkFile(string sourceFile, string destDirectory, int chunkSize)
{
    // Read all lines
    var lines = File.ReadLines(sourceFile)

    // Calculate number of chunks needed
    // Round up to get correct chunks
    var numberOfChunks = (int)Math.Ceiling((double)lines.Count() / chunkSize);

    // Go through each chunk and write to file
    for (var i = 0; i < numberOfChunks; i++)
    {
        // Skip lines chunks we've already seen, and take the next chunk
        var chunk = lines.Skip(i * chunkSize).Take(chunkSize);

        // Write chunk to destination path
        File.WriteAllLines(Path.Combine(destDirectory, $"File{i + 1}.txt"), chunk);
    }
}

Which should generate your chunked files in the format of File1.txt, File2.txt, File3.txt.. etc.

You will also need to implement error handling, such as checking sourceFile exists etc.

Additionally, I suggest taking a look at these two LINQ methods from System.Linq:

It might also be helpful to look at these IO methods from System.IO to read/write files:

Note: We use File.ReadLines instead of File.ReadAllLines to avoid reading the whole file into memory. This is needed when reading large files and decreasing performance. You can read more about this at What is the difference between File.ReadLines() and File.ReadAllLines()?.

RoadRunner
  • 25,803
  • 6
  • 42
  • 75