6

My question is raised based on this question, I had posted an answer on that question..here

This is the code.

var lines = System.IO.File.ReadLines(@"C:\test.txt");
var Minimum = lines[0];//Default length set
var Maximum = "";

foreach (string line in lines)
{    
    if (Maximum.Length < line.Length)
    {
        Maximum = line;
    }

    if (Minimum.Length > line.Length)
    {
        Minimum = line;
    }
}

and alternative for this code using LINQ (My approach)

var lines = System.IO.File.ReadLines(@"C:\test.txt");
var Maximum = lines.OrderByDescending(a => a.Length).First().ToString();
var Minimum = lines.OrderBy(a => a.Length).First().ToString();

LINQ is easy to read and implement..

I want to know which one is good for performance. And how Linq work internally for OrderByDescending and OrderBy for ordering by length?

Community
  • 1
  • 1
sangram parmar
  • 8,462
  • 2
  • 23
  • 47
  • 6
    "which one is good for performance?" 1. Don't micro-optimise. Unless your app is running too slowly, you should not be worrying about performance issues like this. 2. If you really really are worried, test it for yourself. – David Arno Jun 25 '15 at 07:12
  • I want to know how linq work actually for OrderBy functionality. You an see the comments on my answer..http://stackoverflow.com/a/31042828/1660178 that is why i m asking...also for that – sangram parmar Jun 25 '15 at 07:14
  • 2
    You should think about it... If you have done a little CS, you'll know that the first algorithm is `O(n)` in time and `O(1)` in space, with `n` being the length of the collection. There are no generic orderings faster than `O(nlogn)`. And you are doing it twice! – xanatos Jun 25 '15 at 07:15
  • @DavidArno I always though that that line of thought is the one that produced Windows 7... And leads to laziness – xanatos Jun 25 '15 at 07:16
  • Your approach is the most efficient way. It needs only one enumeration for both. – Tim Schmelter Jun 25 '15 at 07:18
  • @xanatos, Vista, I think. With WIn 7, they attached the profiler and optimised the bad spots :) – David Arno Jun 25 '15 at 07:19
  • @TimSchmelter *His* approach is the LINQ one... – Rawling Jun 25 '15 at 07:19
  • 2
    Small observation: `lines` is of type `IEnumerable`, so the `ToString()` calls are unnecessary. – ycsun Jun 25 '15 at 07:19
  • @DavidArno Yep... I meant Vista :-) – xanatos Jun 25 '15 at 07:23
  • 1
    @ycsun: yes, but `String.ToString` returns `this`, so it doesn't matter in terms of perforamance. – Tim Schmelter Jun 25 '15 at 07:25
  • 1
    for a O(n) solution with LINQ look at [aggregate](http://stackoverflow.com/questions/914109/how-to-use-linq-to-select-object-with-minimum-or-maximum-property-value/914155#914155) – Dmitry Pavliv Jun 25 '15 at 07:27
  • 1
    Why sort at all? `lines.Max(l => l.Length)` – knittl Jun 25 '15 at 12:53
  • 1
    @knittl Because `lines.Max(l => l.Length)` won't return the longest string, but an integer representing the length of the longest string. – sloth Jun 25 '15 at 13:23
  • @sangramparmar are you sure ?? – Anant Dabhi Jul 13 '15 at 06:20

4 Answers4

17

You can read the source code for OrderBy.

Stop doing micro-optimizing or premature-optimization on your code. Try to write code that performs correctly, then if you face a performance problem later then profile your application and see where is the problem. If you have a piece of code which have performance problem due to finding the shortest and longest string then start to optimize this part.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3% - Donald Knuth

File.ReadLines is returning an IEnumerable<string>, It means that if you do a foreach over it it will return data to you one by one. I think the best performance improvement you can do here is to improve the reading of file from the disk. If it is small enough to load the whole file into memory use File.ReadAllLines, if it is not try reading the file in big chunks that fits in memory. Reading a file line by line will cause performance degradation due to I/O operation from disk. So the problem here is not how LINQ or loop perform, The problem is in number of disk reads.

Hamid Pourjam
  • 20,441
  • 9
  • 58
  • 74
  • 2
    `ReadLines` might use "big chunks" internally. It does not necessarily make an I/O operation for each `.MoveNext()` or `.Current` invocation on the enumerator. – Jeppe Stig Nielsen Jun 25 '15 at 07:56
  • 1
    @JeppeStigNielsen: `File.ReadLines` is just using a `StreamReader` and it's `ReadLine` method internally. http://referencesource.microsoft.com/#mscorlib/system/io/ReadLinesIterator.cs,e7d359a7224464af,references – Tim Schmelter Jun 25 '15 at 08:21
  • 3
    `ReadLine` uses [ReadBuffer](http://referencesource.microsoft.com/#mscorlib/system/io/streamreader.cs,ef2abdf7bd65b2ec) internally which may read chunk of data from disk. @TimSchmelter – Hamid Pourjam Jun 25 '15 at 08:26
8

In my opinion, you need to understand some points for deciding what is the best way.

First, let's think that we want to solve the problem with LINQ. Then, to write the most optimized code, you must understand Deferred Execution. Most Linq methods, such as Select, Where, OrderBy, Skip, Take and some others uses DE. So, what is Deferred Execution? It means that, these methods will not be executed unless the user doesn't need them. These methods will just create iterator. And this iterator is ready to be executed, when we need them. So, how can user make them execute? The answer is, with the help of foreach which will call GetEnumerator or other Linq methods. Such as, ToList(), First(), FirstOrDefault(), Max() and some others.

These process will help us to gain some performance.
Now, let's come back to your problem. File.ReadLines will return IEnumerable<string>, which it means that, it will not read the lines, unless we need them. In your example, you have twice called sorting method for this object, which it means that it will sort this collection over again twice. Instead of that, you can sort the collection once, then call ToList() which will execute the OrderedEnumerable iterator and then get first and last element of the collection which physically inside our hands.

var orderedList = lines
   .OrderBy(a => a.Length) // This method uses deferred execution, so it is not executed yet
   .ToList(); // But, `ToList()` makes it to execute.

var Maximum = orderedList.Last();
var Minimum = orderedList.First();

BTW, you can find OrderBy source code, here.

It returns OrderedEnumerable instance and the sorting algorithm is here:

public IEnumerator<TElement> GetEnumerator() 
{
    Buffer<TElement> buffer = new Buffer<TElement>(source);
    if (buffer.count > 0) 
    {
        EnumerableSorter<TElement> sorter = GetEnumerableSorter(null);
        int[] map = sorter.Sort(buffer.items, buffer.count);
        sorter = null;
        for (int i = 0; i < buffer.count; i++) yield return buffer.items[map[i]];
    }
}

And now, let's come back to another aspect which effects the performance. If you see, Linq uses another element to store sorted collection. Of course, it will take some memory, which tells us it is not the most efficent way.

I just tried to explain you how does Linq work. But, I am very agree with @Dotctor as a result to your overall answer. Just, don't forget that, you can use File.ReadAllLines which will not return IEnumerable<stirng>, but string[]. What does it mean? As I tried to explain in the beginning, difference is that, if it is IEnumerable, then .net will read line one by one when enuemrator enumerates over iterator. But, if it is string[], then all lines in our application memory.

Nikolai Samteladze
  • 7,699
  • 6
  • 44
  • 70
Farhad Jabiyev
  • 26,014
  • 8
  • 72
  • 98
  • 4
    Also, `O(n log n)` isn't any better than `O(2 n log n)` in the long run. – Rawling Jun 25 '15 at 07:17
  • @Rawling, your comment makes no sense to me (though the upvotes suggest others get it). Sorting twice will (on average) take half as long as doing it once, so how can you claim that they are no different? – David Arno Jun 25 '15 at 07:23
  • 2
    @DavidArno Yes, sorting twice takes twice as long as sorting once, but it's pointless to make *that* optimization when you don't need to sort at all. There's no such thing as `O(2 n log n)` so there's no point switching from `O(n log n)` to `O(n log n)` when it's possible to switch to `O(n)`. – Rawling Jun 25 '15 at 07:27
  • It will be actually 2O(n log n), not O(2 n log n). But anyway sorting here is pointless. – Yura Jun 25 '15 at 07:28
  • 4
    2O(n log n) = O(2 n log n) = O(n log n) @Yura – Hamid Pourjam Jun 25 '15 at 07:29
  • He isn't only sorting the file twice... He is **reading** and sorting the file twice. `ReadLines` returns a `IEnumerable` – xanatos Jun 25 '15 at 07:31
  • In terms of complexity - yes, you are right - it is O(n log n). But in terms of time it is definitely twice as long, which is more specific for performance. @doctor. – Yura Jun 25 '15 at 07:37
8

With the second method, you are not only sorting the lines twice... You are reading the file twice. This because File.ReadLines returns a IEnumerable<string>. This clearly shows why you shouldn't ever ever enumerate a IEnumerable<> twice unless you know how it was built. If you really want to do it, add a .ToList() or a .ToArray() that will materialize the IEnumerable<> to a collection... And while the first method has a memory footprint of a single line of text (because it reads the file one line at a time), the second method will load the whole file in memory to sort it, so will have a much bigger memory footprint, and if the file is some hundred mb, the difference is big (note that technically you could have a file with a single line of text long 1gb, so this rule isn't absolute... It is for reasonable files that have lines long up to some hundred characters :-) )

Now... Someone will tell you that premature optimization is evil, but I'll tell you that ignorance is twice evil.

If you know the difference between the two blocks of code then you can do an informed choice between the two... Otherwise you are simply randomly throwing rocks until it seems to work. Where seems to work is the keyword here.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • "you shouldn't ever ever enumerate a IEnumerable<> twice" strongly disagree. There are many cases where eagerly evaluating the whole contents of an enumerable is stupid and wastes ridiculous amounts of memory. – ahruss Jun 26 '15 at 05:59
  • @ahruss The complete sentence should be "you shouldn't ever enumerate a `IEnumerable<>` twice **unless you know how it was built**" (for example if a `is T[] == true`, then it is an array, so it is ok to enumerate it multiple times). There is no guarantee in the `IEnumerable<>` contract that the data can be read twice, or that the same data will be read if enumerated twice. Clearly even doing a `ToList()`/`ToArray()` introduces an unknown... What if the `IEnumerable<>` is bigger than memory? But from experience this case is rarer than the one "the `IEnumerable<>` is very slow to generate/read" – xanatos Jun 26 '15 at 06:23
  • 1
    Very much appreciated your "premature optimization" vs. "ignorance" comment. I feel that the whole "don't optimize prematurely" bit is too often being thrown around as an excuse to not understand how things work. As we are getting these built-in functions that accomplish more and more with a more succinct syntax, it becomes very easy to fall into the trap of thinking that less lines of code == better performance. Entity Framework is a perfect example where falling into this trap can be fatal. – sstan Jul 01 '15 at 02:06
7

The most efficient approach is to avoid LINQ here, the approach using foreach needs only one enumeration.

If you want to put the whole file into a collection anyway you could use this:

List<string> orderedLines = System.IO.File.ReadLines(@"C:\test.txt")
    .OrderBy(l => l.Length)
    .ToList();
string shortest = orderedLines.First();
string longest  = orderedLines.Last();

Apart from that you should read about LINQ's deferred execution.

Also note that your LINQ approach does not only order all lines twice to get the longest and the shortest, it also needs to read the whole file twice since File.ReadLines is using a StreamReader(as opposed to ReadAllLines which reads all lines into an array first).

MSDN:

When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array

In general that can help to make your LINQ queries more efficient, f.e. if you filter out lines with Where, but in this case it's making things worse.

As Jeppe Stig Nielsen has mentioned in a comment, since OrderBy needs to create another buffer-collection internally(with ToList the second), there is another approach that might be more efficient:

string[] allLines = System.IO.File.ReadAllLines(@"C:\test.txt"); 
Array.Sort(allLines, (x, y) => x.Length.CompareTo(y.Length));
string shortest = allLines.First();
string longest  = allLines.Last();

The only drawback of Array.Sort is that it performs an unstable sort as opposed to OrderBy. So if two lines have the same length the order might not be maintained.

Community
  • 1
  • 1
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • my approach is LINQ..that is why i m asking – sangram parmar Jun 25 '15 at 07:29
  • @sangramparmar: so the best in terms of performance is the loop. – Tim Schmelter Jun 25 '15 at 07:29
  • If he wants it in a collection, he could first do `string[] allLines = System.IO.File.ReadAllLines(@"C:\test.txt"); /* note 'All' */`. That is a collection. Then `Array.Sort(allLines, (x, y) => x.Length.CompareTo(y.Length));` does a quicker sort. – Jeppe Stig Nielsen Jun 25 '15 at 07:37
  • @JeppeStigNielsen: isn't that micro-optimization? Is `OrderBy` really slower than `Array.Sort`? One advantage of `OrderBy` is that it maintains the order if two lines have the same length whereas `Array.Sort` is unstable. – Tim Schmelter Jun 25 '15 at 07:39
  • 3
    I think `OrderBy` must be slower. Maybe it copies everything to a new collection and then does an in-place sort there? More likely it gradually inserts the new items that are yielded by the source, into a new (private) sorted list of some type. But slower does not mean "slow". So it might very well be irrelevant in practice. Your comment about `Sort` being unstable is relevant! We do not know if the order of lines with equal lengths matters. – Jeppe Stig Nielsen Jun 25 '15 at 07:50
  • @JeppeStigNielsen: thanks for the comment, i've put it into my answer :) – Tim Schmelter Jun 25 '15 at 08:02
  • @sangramparmar: note that i've edited my answer considerably. – Tim Schmelter Jun 25 '15 at 08:08