1

my code is :

    int linenumber = File.ReadLines(path).Count();

but it takes long time (about 20 second) for files about 1 gig size .

so does anyone know better way to solve this problem ?

Update 6 :

I have tested your solutions :

for a file about 870 mb size :

method 1 : { my code time(seconds) : 13 }

method 2 : (from MarcinJuraszek & Locke) (the same) {

time(seconds) : 12 }

method 3 : (from Richard Deeming) { time(seconds) : 19 }

method 4 : (from user2942249) { time(seconds) : 13 }

method 5 : (from Locke) { time(seconds) : 13 is the same for lineBuffer = {4096 , 8192 , 16384 , 32768} }

method 6 : (from Locke edition 2) { time(seconds) : 9 for Buffer size = 32KB , time(seconds) : 10 for Buffer size = 64KB }

As i said , in my comment , there is an application (native code) , that opens this file in my pc in 5 second. therefore this is not about h.d.d speed.

By Compiling MSIL to Native Code , the difference was not obvious.

Conclusion : at this time , the Locke method 2 is faster than other method.

So i marked his post as Answer . but this post will be open if any one find better idea.

I gave +1 vote up for dear friends who help me to solve the problem.

Thanks for your help. interesting better idea . Best Regards Smart Man

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Smart Man
  • 225
  • 1
  • 3
  • 16
  • Have you tried any other methods - there are plenty of them – Rob Nov 05 '13 at 17:14
  • 1
    You could just load the file one chunk at a time and add up the number of line break characters from each chunk. – Gabe Nov 05 '13 at 17:15
  • @ Rob : no , i know only this method. can you help me ? thanks. – Smart Man Nov 05 '13 at 17:15
  • 7
    Depending on the speed of your hard drives, it might actually take that long to read a 1 gig file, in which case you are I/O bound, and *no coding technique* will improve the speed. – Robert Harvey Nov 05 '13 at 17:15
  • 2
    @RobertHarvey, fragmentation may even play a role in this as well. – Mike Perrenoud Nov 05 '13 at 17:17
  • There's a good answer [here](http://stackoverflow.com/a/119572/1346943) that mentions an option that may be faster for large files – Sven Grosen Nov 05 '13 at 17:17
  • @ Gabe : can you help me with the sample code ? thanks. – Smart Man Nov 05 '13 at 17:17
  • @SmartMan see the answer I linked to, it has sample code in it – Sven Grosen Nov 05 '13 at 17:17
  • @RobertHarvey : i do not want to read the total file. just to know the total line number of it. – Smart Man Nov 05 '13 at 17:18
  • The only way that I know of to do that is to read the entire file, unless you plan on maintaining a line index for the file as it is written to. – Robert Harvey Nov 05 '13 at 17:19
  • @ledbutter : thanks , i will see it now. – Smart Man Nov 05 '13 at 17:19
  • @SmartMan, it's not like the line count is embedded in meta-data somewhere. You have to read the file to know the number of lines -- and to make matters worse -- it would be a different character you'd be looking for in different encodings. – Mike Perrenoud Nov 05 '13 at 17:19
  • 20 seconds to count the lines in a 1-gigabyte file is about right. One gigabyte in 20 seconds is about 50 megabytes per second, which is pretty typical reading speed for a consumer box. – Jim Mischel Nov 05 '13 at 17:26
  • @neoistheone : my friend , the Ultra Edit (http://www.ultraedit.com) load and count 1 gig size in some seconds very fast with minimum usage of memory . i do not know they do ? therefore there is fastest method ! – Smart Man Nov 05 '13 at 17:28
  • 1
    @RobertHarvey: The OP is already using the solution in the one marked duplicate, and finds it insufficient because that question is looking for "easiest" way to count the line numbers and OP wants "fastest". Those questions have very different answers. – Gabe Nov 05 '13 at 17:28
  • @RobertHarvey : my friend , why you mark this post as duplicated ?! this post is about fastest method to do this but this post : http://stackoverflow.com/questions/119559/determine-the-number-of-lines-within-a-text-file is general. – Smart Man Nov 05 '13 at 17:31
  • UltraEdit may not be loading the entire file at once. A better test is to copy the file from one hard drive to another, or load the file into memory using code. There is extensive treatment of performance over at the duplicate question; are you sure that it hasn't already been adequately covered there? The answers there seem to suggest that you're *already* using the fastest possible method. Have you considered indexing the file, and maintaining that index as additional lines are added to it? Or even just maintaining a line count? – Robert Harvey Nov 05 '13 at 17:35
  • @RobertHarvey : yes , in this post : http://stackoverflow.com/questions/119559/determine-the-number-of-lines-within-a-text-file there is 4 different methods and i tested them but none of them satisfied me ! – Smart Man Nov 05 '13 at 17:40
  • I feel like I'm repeating myself, but *you're probably already doing it the fastest way.* http://www.youtube.com/watch?v=QpZ3dVpE_pY – Robert Harvey Nov 05 '13 at 17:41
  • @RobertHarvey : yes my friend , i am looking for tested fastest way. please mark this post as normal question. thanks. – Smart Man Nov 05 '13 at 17:43
  • Have you considered my suggestion to maintain a line index for the file? – Robert Harvey Nov 05 '13 at 17:43
  • @RobertHarvey : no , can you help me with the sample code ? thanks. – Smart Man Nov 05 '13 at 17:46
  • Read each line from the file, and store its position in a `List`. Add a new position to the list when you add a line to the file. When you need the line count, just call `list.Count` – Robert Harvey Nov 05 '13 at 17:47
  • @RobertHarvey : thanks my friend , if you provide your answer with sample of code in c# , i can test it and if it works , i will mark it as answer. i am not pro as you are. thanks alot for any help. – Smart Man Nov 05 '13 at 17:49
  • Fair warning: *It will still take 20 seconds the first time you read the file to build the index.* Maybe even a little longer. It will take me a day or two to get around to writing the code; if you want it sooner, edit that specific request into your question and I will reopen it. I can't guarantee, however, that your question won't get closed again as an `icanhazcodez` request. – Robert Harvey Nov 05 '13 at 17:51
  • @SmartMan: You really have to explain your purpose. Where does the file come from? Why do you need to know how many lines? What will you be doing with that information? – Gabe Nov 05 '13 at 18:04
  • @RobertHarvey : thanks my friend , i will wait for you and other friends to obtain answer. i think this is a important question for those who care about performance of their application. again , thanks alot my dear friend . – Smart Man Nov 05 '13 at 18:05
  • 1
    @SmartMan: Just make sure you do your homework. You can't really expect others to write your application for you. – Robert Harvey Nov 05 '13 at 18:07
  • @SmartMan, I added a second method with a buffer you should try adjusting for performance testing. I've found that on large files, increasing the buffer size can help. Obviously there are diminishing returns, however it could boost you to sub 10 seconds per Gb. – Parrish Husband Nov 05 '13 at 22:05
  • @SmartMan: How are you measuring the time? I don't have a multi-Gb file to test, but on a 5.5Mb file, @Locke's `FileStream` method is consistently faster than the `ReadLines(path).Count()` method. – Richard Deeming Nov 06 '13 at 11:33
  • @SmartMan did you take JIT compilation into account with your testing? To performance-test .NET code I would write a tester app that executes the code at least 11 times, recording the time taken for each execution. The first time includes JIT compilation time so this is always longer and therefore irrelevant. I'd average the remaining 10 times. Alternatively, [precompile](http://msdn.microsoft.com/en-us/library/ht8ecch6%28v=VS.90%29.aspx) the .NET code. UltraEdit may be native code that doesn't require JIT compilation. – groverboy Nov 06 '13 at 12:36
  • What about files with different text encodings? Does UltraEdit perform similarly with 1 Gb files encoded as eg. ASCII, CJK, UTF-8, UTF-16? While the code by @RichardDeeming is slower than the other answers, it's the only one that takes text encoding into account, which I reckon is essential. – groverboy Nov 06 '13 at 12:44
  • @groverboy : i am happy to see your interesting ideas. thanks alot for your attention. i have tested for 8 times but no difference occurred. the Ultra-Edit is native code (c++). i have not tested for different text encoding. i will test this and then i will update the performance reports.thanks alot. – Smart Man Nov 06 '13 at 12:52
  • @SmartMan when you say "8 times" do you mean run a program (EXE) 8 times or have a test program execute a test method 8 times? – groverboy Nov 06 '13 at 12:56
  • @groverboy : execute a test method 8 times. I am going to Compiling MSIL to Native Code to see the effect on performance. – Smart Man Nov 06 '13 at 13:00
  • Btw about a year ago I tested the processing time of large files using range of buffer sizes. With a 64 Kb buffer I got the shortest time, consistently. Smaller or larger buffers yielded longer times. Tested on a Dell XPS (WinXP) and a Dell vostro (Win7). The optimum size probably depends on a range of variables, not only disk block size, and no doubt will change as disk capacities increase. – groverboy Nov 06 '13 at 13:13
  • @SmartMan I added a multithreaded variant to my filestream method. I see solid improvements on my machine using a 145Mb file. My HDD is nothing special (a WD Blue 250Gb) with a Xeon E3-1225 processor. – Parrish Husband Nov 06 '13 at 14:15
  • @groverboy, I tested your ideal 64Kb buffer size using my multithreading method and I can confirm you are correct. – Parrish Husband Nov 06 '13 at 14:49

4 Answers4

3

Here are a few ways this can be accomplished quickly:

StreamReader:

using (var sr = new StreamReader(path))
{
    while (!String.IsNullOrEmpty(sr.ReadLine()))
        lineCount ++;
}

FileStream:

var lineBuffer = new byte[65536]; // 64Kb
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
       FileShare.Read, lineBuffer.Length))
{
    int readBuffer = 0;
    while ((readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0)
    {
        for (int i = 0; i < readBuffer; i++)
        {
            if (lineBuffer[i] == 0xD) // Carriage return + line feed
                lineCount++;
        }
    }
}

Multithreading:

Arguably the number of threads shouldn't affect the read speed, but real world benchmarking can sometimes prove otherwise. Try different buffer sizes and see if you get any gains at all with your setup. *This method contains a race condition. Use with caution.

var tasks = new Task[Environment.ProcessorCount]; // 1 per core
var fileLock = new ReaderWriterLockSlim();
int bufferSize = 65536; // 64Kb

using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
        FileShare.Read, bufferSize, FileOptions.RandomAccess))
{
    for (int i = 0; i < tasks.Length; i++)
    {
        tasks[i] = Task.Factory.StartNew(() =>
            {
                int readBuffer = 0;
                var lineBuffer = new byte[bufferSize];

                while ((fileLock.TryEnterReadLock(10) && 
                       (readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0))
                {
                    fileLock.ExitReadLock();
                    for (int n = 0; n < readBuffer; n++)
                        if (lineBuffer[n] == 0xD)
                            Interlocked.Increment(ref lineCount);
                }
            });
    }
    Task.WaitAll(tasks);
}
Parrish Husband
  • 3,148
  • 18
  • 40
  • Just tested it, and I don't see a difference up or down in execution time, however that could change with a larger file. It's interesting though because I thought I needed FileShare.Read for threads to work on the same file together. – Parrish Husband Nov 06 '13 at 15:02
  • FileStream isn't designed to be accessed from multiple threads simultaneously. Your multithreaded code won't necessarily *work*. – Servy Nov 06 '13 at 18:07
  • @Servy, while I agree with you, it seems to be working on the small-scale tests so far. If you have some light you could shed on why the performance gains are happening I'm happy to listen. – Parrish Husband Nov 06 '13 at 18:18
  • 1
    @Locke The code has race conditions. That doesn't mean it never works, it means it will sometimes work and it sometimes won't work; you can never know anytime you run the program what you'll get. The performance characteristics are *irrelevant* if it doesn't work. A program that's 10x faster is still useless if it doesn't provide the correct answer. – Servy Nov 06 '13 at 18:20
  • @Servy, let me know if the update I just made satisfies race condition avoidance. – Parrish Husband Nov 06 '13 at 21:24
  • @Locke You have not; you still have multiple threads reading from a filestream without any synchronization at all. – Servy Nov 06 '13 at 21:28
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/40671/discussion-between-locke-and-servy) – Parrish Husband Nov 06 '13 at 21:32
1

Assuming that building a string to represent each line is what's taking the time, something like this might help:

public static int CountLines1(string path)
{
   int lineCount = 0;
   bool skipNextLineBreak = false;
   bool startedLine = false;
   var buffer = new char[16384];
   int readChars;

   using (var stream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, buffer.Length))
   using (var reader = new StreamReader(stream, Encoding.UTF8, false, buffer.Length, false))
   {
      while ((readChars = reader.Read(buffer, 0, buffer.Length)) > 0)
      {
         for (int i = 0; i < readChars; i++)
         {
            switch (buffer[i])
            {
               case '\n':
               {
                  if (skipNextLineBreak)
                  {
                     skipNextLineBreak = false;
                  }
                  else
                  {
                     lineCount++;
                     startedLine = false;
                  }
                  break;
               }
               case '\r':
               {
                  lineCount++;
                  skipNextLineBreak = true;
                  startedLine = false;
                  break;
               }
               default:
               {
                  skipNextLineBreak = false;
                  startedLine = true;
                  break;
               }
            }
         }
      }
   }

   return startedLine ? lineCount + 1 : lineCount;
}

Edit 2:
It's true what they say about "assume"! The overhead of calling .Read() for each character outweighs the savings from not creating a string for each line. Even updating the code to read a block of characters at a time is still slower than the original method.

Richard Deeming
  • 29,830
  • 10
  • 79
  • 151
  • `reader.Peek()` is going to be a slowdown too. If you can do it without peeking it would be faster. Perhaps putting the read at the end of the method instead of the start (and not calling it in the case of `13` because it already did the read) – Scott Chamberlain Nov 05 '13 at 18:16
  • @ScottChamberlain: `StreamReader` buffers the data as it reads it (at least in .NET 4.5), so `Peek()` followed by `Read()` shouldn't make any significant difference. – Richard Deeming Nov 05 '13 at 18:18
  • Also note that he doesn't want to count blank lines; this doesn't account for that. – Servy Nov 05 '13 at 18:53
  • @SmartMan: I'm surprised this is so much slower. I guess it's the trade-off between not creating the string for each line and calling the `Read` method for every character. – Richard Deeming Nov 06 '13 at 11:30
1

It is hardware dependent, one question is what is the best buffer size. Perhaps something equal to the disk sector size or greater. After experimenting myself, I've found it's usually best to let the system determine that. If speed really is a concern, you can drop down to the Win32 API ReadFile/CreateFile specifying various flags and parameters such as async IO and no buffering, sequential read, etc... which may or may not help improve performance. You'll have to profile and see what works best on your system. In .NET you may be able to pin the buffer for better performance, of course pinning memory in GC environment has other ramifications, but if you don't keep it around too long, etc...

    const int bufsize = 4096;
    int lineCount = 0;
    Byte[] buffer = new Byte[bufsize];
    using (System.IO.FileStream fs = new System.IO.FileStream(@"C:\\data\\log\\20111018.txt", FileMode.Open, FileAccess.Read, FileShare.None, bufsize))
    {
        int totalBytesRead = 0;
        int bytesRead;
        while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) > 0) {
            int i = 0;
            while (i < bytesRead)
            {
                switch (buffer[i])
                {
                    case 10:
                        {
                            lineCount++;
                            i++;
                            break;
                        }
                    case 13:
                        {
                            int index = i + 1;
                            if (index < bytesRead)
                            {
                                if (buffer[index] == 10)
                                {
                                    lineCount++;
                                    i += 2;
                                }
                            }
                            else
                            {
                                i++;
                            }
                            break;
                        }
                    default:
                        {
                            i++;
                            break;
                        }
                }
            }
            totalBytesRead += bytesRead;
        }
        if ((totalBytesRead > 0) && (lineCount == 0))
            lineCount++;                    
    }
user2942249
  • 145
  • 1
  • 5
1

As your tests showed, changes in code aren't going to have a significant affect on the speed. The bottleneck is in your disk reading the data, not the C# code processing it.

If you want to speed up the execution of this task buy a faster/better hard drive, either one with a higher RPM, or even a solid state drive. Alternatively you could consider using RAID0, which could potentially improve your disk read speeds.

Another option would be to have multiple hard drives, and to break up the file so that each drive stores one portion, you can then parallelize the work with one task handling the file on each drive. (Note that parallelizing the work when you only have one disk won't help anything, and is more likely to actually hurt.)

Servy
  • 202,030
  • 26
  • 332
  • 449
  • Thanks but as i said , there is a application , UltraEdit , that opens this file in my pc in 5 second. therefore this is not about h.d.d speed. – Smart Man Nov 05 '13 at 20:28
  • 1
    @SmartMan That doesn't mean it's actually loaded the entire contents of the file. It can rely on paging and differed execution to not load the entire file's contents into memory. – Servy Nov 05 '13 at 20:31
  • sure, it uses min of memory. so how it opens and count total number of lines in just 5 second ? (my main question is about count line number only) – Smart Man Nov 05 '13 at 20:34
  • Hmm, I had a response going into UltaEdit's file loading technique, but the line count total being present so quickly is interesting. – Parrish Husband Nov 05 '13 at 20:36
  • Also, how did you measure the performance of this other application? If your timing mechanism is just you counting, being off by several seconds isn't impossible. It could also be caching information about the line numbers if you've loaded the file before. – Servy Nov 05 '13 at 20:38
  • @Servy : I measured by Stop Watch. the difference is also obvious. and about caching information , i have tested 5 different files and all of them opend as fast as they opened at first time. – Smart Man Nov 05 '13 at 21:32