1

SO I tried to make an unsafe fast method to count lines. I previously used StringReader, but wanted to see if I could make something faster.

So is this code valid, it seems to work but it looks a bit confusing,

and I am very new to C# pointers so I might be doing something bad.

Original Method:

//Return number of (non Empty) lines
private static int getLineCount(string input)
{
    int lines = 0;
    string line = null;

    //Don't count Empty lines
    using (StringReader reader = new StringReader(input))
        while ((line = reader.ReadLine()) != null)
            if (!string.IsNullOrWhiteSpace(line))
                lines++;

    return lines;
}

Unsafe Method:

//Return number of (non Empty) lines (fast method using pointers)
private unsafe static int getLineCountUnsafe(string input)
{
    int lines = 0;

    fixed (char* strptr = input)
    {

        char* charptr = strptr;
        int length = input.Length;
        //Don't count Empty lines
        for (int i = 0; i < length; i++)
        {
            char c = *charptr;
            //If char is an empty line, look if it's empty
            if (c == '\n' || c == '\r')
            {
                //If char is empty, continue till it's not
                while (c == '\n' || c == '\r')
                {
                    if (i >= length)
                        return lines;
                    i++;
                    charptr++;
                    c = *charptr;
                }
                //Add a line when line is not just a new line (empty)
                lines++;
            }
            charptr++;
        }

        return lines;
    }
}

Benchmark:

(Looped through 100000, 10 times)
Total Milliseconds used.

Safe(Original) - AVG = 770.10334, MIN = 765.678, MAX = 778.0017 , TOTAL 07.701
Unsafe - AVG = 406.91843, MIN = 405.7931, MAX = 408.5505 , TOTAL 04.069

EDIT:

It seems that the Unsafe version isn't always correct, if it's one line it won't count it, been trying to solve it without making it count too many;(

Zerowalker
  • 761
  • 2
  • 8
  • 27
  • 2
    Did your *complicated* method actually makes your operation faster? – Niyoko Nov 15 '16 at 13:20
  • In my tests, yeah, from about "700ms" to about "300-400ms" i think. – Zerowalker Nov 15 '16 at 13:22
  • 1
    Describe *valid*? If you asking does it work, then shouldn't you test it? – Liam Nov 15 '16 at 13:22
  • 1
    This should probably be moved to [Code Review](http://codereview.stackexchange.com). Also, I think this code requires the entire file to be loaded into memory, which is okay if one wants to explicitly count lines in a string, but sub-optimal if we're talking about huge files. (edited to add the last sentence) – s.m. Nov 15 '16 at 13:22
  • what i mean with Valid is, it doesn't do anything "illegal" or what not? I mean, i don't know much of unsafe code, so this is just something i messed around to make. I don't show it to be reviewed per say, as the code Might be broken. In my tests it seems to give the same results as the original code, so i can just assume it works as intended. Would just like ppl who know how pointers work to say if i am doing anything bad etc. Sorry for being unclear. – Zerowalker Nov 15 '16 at 13:25
  • *"700ms" to about "300-400ms" i think* ... don't think, use a stopwatch or the visual studio build in timer. Code reviews belong to another website as already mentioned by s.m. – Jim Nov 15 '16 at 13:27
  • Do you actually need this micro optimization? – L-Four Nov 15 '16 at 13:30
  • 1
    @Jim , i meant as in (don't remember). I posted the benchmark. L-Four , not really, it's more for learning pointers and having some fun:) – Zerowalker Nov 15 '16 at 13:35
  • Ok fair enough :) – L-Four Nov 15 '16 at 13:36

1 Answers1

3

Your second implementation seems okay, but don't bother too much with learning unsafe, it is not so widely used in C#, neither pointers. This is getting close to C++. The time difference between the both approaches might come from the avoiding garbage collector to collect any memory inside the method until it is done (because of the fixed keyword).

The reason why one should rarely use unsafe is because C# provides much readability and ease of use within it's already defined methods, like in your case:

//Return number of (non Empty) lines
private static int getLineCount(string input)
{
     return Regex.Matches(input, Environment.NewLine).Count;
}

which may be even faster because of the evaluating at once of the entire string.

meJustAndrew
  • 6,011
  • 8
  • 50
  • 76
  • `new StringReader(input)` creates a reader from the string, then `reader.ReadToEnd()` gives the string back. Seems very unnecessary. – Magnus Nov 15 '16 at 15:29
  • @Magnus sir, you are perfectly right, I have updated my answer. – meJustAndrew Nov 15 '16 at 15:52
  • 1
    For what it's worth i did a test. Regex - AVG = 20.9546, MIN = 20.6002, MAX = 21.6283 , TOTAL 00.210 Unsafe - AVG = 1.61361, MIN = 1.589, MAX = 1.6748 , TOTAL 00.016 – Zerowalker Nov 15 '16 at 16:00
  • @Zerowalker How about: `foreach(var c in input) if(c == '\n') lines++; `? – Magnus Nov 15 '16 at 16:04
  • @Zerowalker have you tested them both in Release configuration without debugger attached? This sounds like a big difference, but again, it is a rare case, and for a human, pretty rare to feel this difference in a normal execution of a program. The simple advantage of this kind of *safe* program is its maintainability which is very important compared with a few miliseconds . – meJustAndrew Nov 15 '16 at 16:10
  • @Magnus I don't find this much better, as long as I could have used : `str.Count(x => x == '\n');` but `'\n'` is not supposed to be used to identify lines, because they could have different identifiers such as: `'\r'` `'\n'` `'\r\n'` as [here](http://stackoverflow.com/questions/1552749/difference-between-cr-lf-lf-and-cr-line-break-types). – meJustAndrew Nov 15 '16 at 16:48
  • Release mode was used, pretty sure of it. The difference is Huge when comparing pointer methods with Regex, tried some others and Regex is Really Slow (though as you said, not within human range). As for '\n' etc, yeah it sucks, the only way i thought of was to just treat either \r and \n as a new line, cause either of them comes at a new line so it should be safe i guess. – Zerowalker Nov 15 '16 at 19:34
  • @Zerowalker I am glad I could provide some useful information, about how safe you can get comparing with only \n or \r will work depending on your system, this is why I have used `Environment.NewLine`. Also, I know that regex has performance lacks. One should consider when readability shoud be chosen insted of performance and vice versa :) – meJustAndrew Nov 15 '16 at 21:24
  • @meJustAndrew , does't Enviroment.NewLine == "\r\n" ? And yeah i completely agree, readability is much to be considered, especially doing non critical work. This work is non critical, but more experimental, It's meant for learning purposes and stuff, so readability in this special case is not of much importance, though of course suggestions will add to the knowledge base anyhow:) – Zerowalker Nov 15 '16 at 23:50
  • @Zerowalker actually, it's half true, if you run it on Windows, as you just did, it is "\r\n" but if you want to run it on different platforms it will be either "\r" or "\n", according to msdn, and revealed by [this simple question](http://stackoverflow.com/questions/1015766/difference-between-n-and-environment-newline). This is why I have used it, within a regex, and did not choose to use the `Count` fom LINQ (see my comments above). – meJustAndrew Nov 16 '16 at 05:57
  • @Zerowalker just as an advice, always at a benchmark it is supposed to take into consideration the lowest values bcause for that values you had the less other processes running on the same computer which slow down the processor (so when ou got the less values, they were the closest to the actual real values of processing time). But this is just an advice. – meJustAndrew Nov 16 '16 at 06:37
  • @meJustAndrew , ah so Min is what i should preferably look at, got it:) I noticed however that using pointers the benchmark is hugely affected by Any CPU vs x64, usually if a fast non-unsafe standard method gets say 10ms, and an unsafe gets 15ms, when you switch to x64 they switch. It's really weird, not really sure what i should make of it, but as i stick with Any CPU as it seems most optimized most of the time i go with those results. – Zerowalker Nov 17 '16 at 20:54
  • I'm *so* going to have to disagree with you on the `unsafe` and pointers stuff... I use them plenty regularly. You shouldn't use them all the time, sure; but there's absolutely nothing wrong with knowing about those tools. – Marc Gravell Nov 17 '16 at 23:10