-3

I want to loop on all the lines of a very large file (10GB for example) using foreach

I am currently using File.ReadLines like that:

var lines = File.ReadLines(fileName);
foreach (var line in lines) {
  // Process line
}

But this is very slow if the file is larger than 2MB and it will do the loop very slowly.

How can I loop on very large files?

Any help would be appreciated.

Thanks!

Mario
  • 1,374
  • 6
  • 22
  • 48
  • 1
    Possible duplicate of [What's the fastest way to read a text file line-by-line?](https://stackoverflow.com/questions/8037070/whats-the-fastest-way-to-read-a-text-file-line-by-line) – Owen Pauling Aug 15 '18 at 09:58

3 Answers3

1

The way you do it is the best way available given that

  • you don't want to read your whole file into RAM at once
  • your line processing is independent of previous lines

Sorry, reading stuff from a hard disk is just slow.

Improvements will likely come from other sources:

  • store your file on a faster device (SSD?)
  • get more RAM to read your file into memory to at least speed up processing
Selman Genç
  • 100,147
  • 13
  • 119
  • 184
nvoigt
  • 75,013
  • 26
  • 93
  • 142
0

First of all do you need to read the whole file or only the section of the file.

If you only need to read the section of the file

const int chunkSize = 1024; // read the file by chunks of 1KB
using (var file = File.OpenRead("yourfile"))
{
    int bytesRead;
    var buffer = new byte[chunkSize];
    while ((bytesRead = file.Read(buffer, 0 /* start offset */, buffer.Length)) > 0)
    {
        // TODO: Process bytesRead number of bytes from the buffer
        // not the entire buffer as the size of the buffer is 1KB
        // whereas the actual number of bytes that are read are 
        // stored in the bytesRead integer.
    }
}

If you need to load the whole file to the memory.

Use this method repeatedly instead of directly loading to the memory since you have control over what you are doing and at any time you can stop the process.

Or you can use MemoryMappedFile https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx?f=255&MSPPError=-2147217396

Memory mapped files will give a view to the program as beign accessed from the Memory, but it will load from the Disk for the first time only.

long offset = 0x10000000; // 256 megabytes
long length = 0x20000000; // 512 megabytes

// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(@"c:\ExtremelyLargeImage.data", FileMode.Open,"ImgA"))
{
     // Create a random access view, from the 256th megabyte (the offset)
     // to the 768th megabyte (the offset plus length).
     using (var accessor = mmf.CreateViewAccessor(offset, length))
     {
         //Your process
     }
}
Vibeeshan Mahadeva
  • 7,147
  • 8
  • 52
  • 102
  • 2
    Why would you think OP only needs a section of the file? – DavidG Aug 15 '18 at 09:50
  • I bet OP is reading and processing line-by-line:\ – vasily.sib Aug 15 '18 at 09:53
  • 2
    Your edit ignores the fact the `File.ReadLines` does all this streaming for you. – DavidG Aug 15 '18 at 09:56
  • I still beleive that this method is better even to load to the whole file to memory, since you have controll over the process and show messages appropriatly if needed – Vibeeshan Mahadeva Aug 15 '18 at 09:56
  • 2
    I don't see how this is applicable to the question. Sure you could do this... and write a lot of code on top of this to even reach what the OP sees as given... and it's still not a speed improvement. – nvoigt Aug 15 '18 at 09:58
-5

The looping will always be slow because of the sheer number of items that you have to loop through. Im pretty sure that its not the looping but the actual work you are doing on each one of those lines that slows it down. A file with 10GB of lines could literally have trillions of lines and anything but the most simple of tasks will take a lot of time.

You could always try making the job threaded so that a different thread is working on a different line. That way at least you have more cores working on the problem.

Set up a for loop and have them increment at different amounts.

Also, im not 100% but I think that you could get a huge increase in speed by splitting the whole thing into an array of string by splitting based on new lines and then working through those since everything is stored in the memory.

string lines = "your huge text";
string[] words = lines.Split('\n');
foreach(string singleLine in lines)
{

}

** Added based on comments ** So there's massive downsides and will take a huge amount of memory. At least the amount that the original file used but this gets round the problem of a slow hard drive and all the data will be read directly into the RAM of the machine, which will be far far faster than reading from the hard drive in small chunks.

There is also an issue here of having a limit of about 2 billion lines, since that the is the maximum number of entries in an array that you can have.

Christopher Vickers
  • 1,773
  • 1
  • 14
  • 18
  • 1
    So... copy 10GB of data manually into the code, resulting in a 10GB in-memory string literal? That sounds fun... – David Aug 15 '18 at 09:56
  • 1
    Are you guessing? Because I don't think you have tested this at all. Not only will it probably be slower, it will also waste around 20 GB or more of RAM. – nvoigt Aug 15 '18 at 09:57
  • 1
    "A file with 10GB of lines would literally have trillions of lines..." -> Paste to an array. So now we have a 10gb string literal (how will the compiler handle that (Im curious actually)) and now we have an array with 10gb worth of entries (I'm also curious how that will be handled). – Adrian Aug 15 '18 at 10:00
  • The comments on here: "Are you guessing? Because I don't think you have tested this at all". Of course I haven't tested this. I'm not being paid. These are just my thoughts. The OP can play around and see what's fastest – Christopher Vickers Aug 15 '18 at 10:07
  • I agree, that reading from RAM is "far far faster" then from HDD, but your solution need to read from slow HDD to fast RAM at speed of slow HDD. Isn't this the same as if I read a file from HDD line-by-line? – vasily.sib Aug 15 '18 at 10:08
  • But the code is read as a steady stream instead of small chunks, which on a spinning platter HD will be 1000x faster than if a single line is read each time. Either way the stuff needs to be read into the system. At least my way its done as a single fast chunk. A 10GB file can be read into memory in 20 seconds on a semi decent single platter hard drive. – Christopher Vickers Aug 15 '18 at 10:13