2

I have this huge json file neatly formated starting with the characters "[\r\n" and ending with "]". I have this piece of code:

foreach (var line in File.ReadLines(@"d:\wikipedia\wikipedia.json").Skip(1))
{
  if (line[0] == ']') break;
  // Do stuff
}

I'm wondering, what would be best performance-wise, what machine code would be the most optimal in regards to how many clock cycles and memory is consumed if I were to compare the above code to one where I have replaced "break" with "continue", or would both of those pieces of code compile to the same MSIL and machine code? If you know the answer, please explain exactly how you reached your conclusion? I'd really like to know.

EDIT: Before you close this as nonsensical, consider that this code is equivalent to the above code and consider that the c# compiler optimizes when the code path is flat and does not fork in a lot of ways, would all of the following examples generate the same amount of work for the CPU?

IEnumerable<char> text = new[] {'[', 'a', 'b', 'c', ']'};
foreach (var c in text.Skip(1))
{
    if (c == ']') break;
    // Do stuff
}
foreach (var c in text.Skip(1))
{
    if (c == ']') continue;
    // Do stuff
}
foreach (var c in text.Skip(1))
{
    if (c != ']')
    {
        // Do stuff                    
    }
}
foreach (var c in text.Skip(1))
{
    if (c != ']')
    {
        // Do stuff                    
    }
}
foreach (var c in text.Skip(1))
{
    if (c != ']')
    {
        // Do stuff                    
    }
    else
    {
        break;
    }
}

EDIT2: Here's another way of putting it: what's the prettiest way to skip the first and last item in an IEnumerable while still deferring the executing until //Do stuff?

Marcus
  • 967
  • 10
  • 20
  • 4
    Best option is using a JSON-parser instead of doing this self. – MakePeaceGreatAgain Mar 04 '16 at 14:15
  • Are you asking how to parse JSON? Or are you asking how to benchmark code? – mason Mar 04 '16 at 14:21
  • I can't do that, the file is too large, but I'm using Newtonsoft to parse the individual lines. – Marcus Mar 04 '16 at 14:21
  • I do know how to parse JSON thx :), more interested in knowing how and if "continue" instead of "break" would generate a different set of tasks for the CPU to execute, – Marcus Mar 04 '16 at 14:23
  • 4
    What keeps you from compiling both and looking at the MSIL code? – Emond Mar 04 '16 at 14:26
  • @ErnodeWeerd I'd liek a lecture in how to do that pls. – Marcus Mar 04 '16 at 14:27
  • 3
    They should compile to jump instructions either way. `continue` jumps to the beginning of the next loop iteration, while `break` jumps to the exit of the loop. Comparing performance between the two seems nonsensical as they are for different operations... – Glorin Oakenfoot Mar 04 '16 at 14:29
  • @GlorinOakenfoot Wouldn't a jump to the beginning of the next iteration be more work for the CPU compared to exiting a loop? – Marcus Mar 04 '16 at 14:31
  • 2
    @Marcus - The jump itself, no. It's all just some location in memory. You will be doing more work *because you are continuing to iterate through the loop*, however. – Glorin Oakenfoot Mar 04 '16 at 14:32
  • 3
    I find it very hard to give a meaningful answer because continue and break do different things. Clearly, break is often cheaper because it ends the loop. Can you post two equivalent pieces of code that you want compared? – usr Mar 04 '16 at 14:42
  • So your actual problem is how to parse a json file that's too large to fit into memory? Or is parsing taking too long and you're looking to speed it up? Or both? Are you sure you have exhausted all other (higher-level) kinds of optimizations? – Pieter Witvoet Mar 04 '16 at 14:46
  • 1
    You're parsing string data; the difference between a `continue` and a `break` (which are both simply some form of conditional or unconditional branch) is nothing except of course: whether or not you want to run the next loop iteration. It is just "jump past the loop" vs "jump to the check code" – Marc Gravell Mar 04 '16 at 14:51
  • @usr No, in this case "continue" does the same thing as "break". They are interchangeable in my code-examples. They both hinder execution of the code that comes next within the iteration body. – Marcus Mar 04 '16 at 14:52
  • @MarcGravell Yes and since "jump to check code" would have to be followed by "jump past the loop" if I say "continue" then "break" is actually less work. Yes? – Marcus Mar 04 '16 at 14:55
  • 1
    @Marcus if you want to exit a loop, then yes of course `break` is marginally more direct - it is essentially "goto outside the loop" rather than "goto the end of the block and retry" – Marc Gravell Mar 04 '16 at 15:02
  • @MarcGravell I should learn assembly and dissassembly to not have to trust what others say but thanks for confirming that what I have read so far about c#/c/assembly has been understood:) – Marcus Mar 04 '16 at 15:16
  • They're only 'equivalent' if the input contains a single `']'` at the end, and in that case `continue` would cause the `foreach` loop to still try to get the next character, while `break` would skip that step. But you may not be able to guarantee that your input is well-formed, and besides, this only affects the end of the loop - it doesn't speed up the work performed in its body. What matters is what behavior you want: should characters after the `']'` be ignored or should an error be raised? – Pieter Witvoet Mar 04 '16 at 15:18
  • @Marcus In response to your 'lecture' request: looking at the MSIL code is pretty easy. Use (online) .net fiddler or a tool like ILSpy, Reflector, etc. – atlaste Mar 04 '16 at 15:34
  • @PieterWitvoet Good point about well-formed imput. – Marcus Mar 05 '16 at 02:30

1 Answers1

4

Q: Different MSIL for break or continue in loop?

Yes, that's because it works like this:

foreach (var item in foo)
{
    // more code...

    if (...) { continue; } // jump to #1
    if (...) { break; } // jump to #2

    // more code...

    // #1 -- just before the '}'
}

// #2 -- after the exit of the loop.

Q: What will give you the most performance?

Branches are branches for the compiler. If you have a goto, a continue or a break, it will eventually be compiled as a branch (opcode br), which will be analyzes as such. In other words: it doesn't make a difference.

What does make a difference is having predictable patterns of both data and code flow in the code. Branching breaks code flow, so if you want performance, you should avoid irregular branches.

In other words, prefer:

for (int i=0; i<10 && someCondition; ++i)

to:

for (int i=0; i<10; ++i) 
{
    // some code
    if (someCondition) { ... } 
    // some code
}

As always with performance, the best thing to do is to run benchmarks. There's no surrogate.

Q: What will give you the most performance? (#2)

You're doing a lot with IEnumerable's. If you want raw performance and have the option, it's best to use an array or a string. There's no better alternative in terms of raw performance for sequential access of elements.

If an array isn't an option (for example because it doesn't match the access pattern), it's best to use a data structure that best suits the access pattern. Learn about the characteristics of hash tables (Dictionary), red black trees (SortedDictionary) and how List works. Knowledge about how stuff really works is the thing you need. If unsure, test, test and test again.

Q: What will give you the most performance? (#3)

I'd also try JSON libraries if your intent is to parse that. These people probably already invented the wheel for you - if not, it'll give you a baseline "to beat".

Q: [...] what's the prettiest way to skip the first and last item [...]

If the underlying data structure is a string, List or array, I'd simply do this:

for (int i=1; i<str.Length-1; ++i)
{ ... }

To be frank, other data structures don't really make sense here IMO. That said, people somethings like to put Linq code everywhere, so...

Using an enumerator

You can easily make a method that returns all but the first and last element. In my book, enumerators always are accessed in code through things like foreach to ensure that IDisposable is called correctly.

public static IEnumerable<T> GetAllButFirstAndLast<T>(IEnumerable<T> myEnum)
{
    T jtem = default(T);
    bool first = true;
    foreach (T item in myEnum.Skip(1)) 
    { 
        if (first) { first = false; } else { yield return jtem; }  
        jtem = item;
    }
}

Note that this has little to do with "getting the best performance out of your code". One look at the IL tells you all you need to know.

atlaste
  • 30,418
  • 3
  • 57
  • 87
  • The main take-away for me was this: What does make a difference is having predictable patterns of both data and code flow in the code. Branching breaks code flow, so if you want performance, you should avoid irregular branches. – Marcus Mar 05 '16 at 02:14
  • But say you have to do something a whole lot of times in a loop and there is a condition you need to check on each iteration, for example "is count less than max size" then would a while loop outperform an if-clause? I'm iterating through 67GB of json, millions of records. I'd like to know I don't branch unneccessarily. – Marcus Mar 05 '16 at 02:26
  • 1
    @Marcus Predictable patterns with less branches (usually the shortest code) is usually better. I wrote a bunch of stuff down here on how your compiler behaves with branches: http://stackoverflow.com/questions/324831/breaking-out-of-a-nested-loop/35755622#35755622 . Note that a while loop is also a branch, just like an if from an IL point of view. – atlaste Mar 05 '16 at 08:31
  • thx that was my concern. No need to make your code less readable then since the compiler optimizes it, seems to be the rule most of the times but to take care of unneccessary branching (e.g conditions within loops) – Marcus Mar 05 '16 at 11:13