Parsing routine for over-sign characters relatively slow

Question

I am trying to make the enclosed subroutine more performant using NET framework 4.6.1 although I will eventually port it to net core 2.2 .

It may run up to 250,000 times when parsing a file.

Using Visual Studio Performance Analyzer I see this routine seems to have a fairly high relative cost in the whole parsing process.

The code is part of a parsing program whose input is a binary file that contains some very old record formats that contain over-signed numbers.

Over-signed Numbers (background)

Instead of a minus sign and in order to save space the last digit is made a letter if the number is negative. Its a very old standard dating back to when computers had limited memory and fixed width records were required for performance.

When parsing I need to convert the last letter back to a number and make the number negative

Some examples of input and output of the routine

00056K = -562

00032N = -325

Current Code (slow)

private int ConvertOverSign(string overSignedString)
{
    switch(overSignedString.Substring(overSignedString.Length -1,1))
    {
        case " ":
            return 0;
        case "J":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "1");
        case "K":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "2");
        case "L":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "3");
        case "M":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "4");
        case "N":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "5");
        case "O":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "6");
        case "P":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "7");
        case "Q":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "8");
        case "R":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) +  "9");
        case "!":
            return -Convert.ToInt32(overSignedString.Substring(0,overSignedString.Length -1) + "0");
        default:
            return Convert.ToInt32(overSignedString);
    }
}

How slow is "slow"? Make sure you don't do micro-optimizations. — CodeCaster, Dec 18 '18 at 15:51
Your code would be shorter and more readable if you saved `overSignedString.Substring(0,overSignedString.Length -1)` to a variable instead of copy/pasting it in every case. You'd reduce the length of your code by 500+ characters (each duplicate of that line is 56 characters long, and its duped 10 times). — , Dec 18 '18 at 15:53
What is the max number of digits for your fixed width values? — BurnsBA, Dec 18 '18 at 16:16
@Amy Wouldn't that do an allocation every time the function was called? — mikes-so, Dec 19 '18 at 08:46
@mikes-so You're doing that anyway every time the function is called - twice. The change I propose would not negatively hurt the efficiency of your code, but it would make it much shorter and more readable. It's called refactoring. — , Dec 19 '18 at 14:17
Not for all case the change you propose would substring even when not needed. — mikes-so, Dec 19 '18 at 18:19

Mario Vernari · Accepted Answer · 2018-12-19T09:00:31.553

3

Not sure the below solution is fully equivalent to yours, but at least should give you a hint on how to make a very fast string-to-number parser.

    private int ConvertOverSign(string overSignedString)
    {
        if (overSignedString == " ") return 0;

        int value = 0;
        for (int i = 0; i < overSignedString.Length; i++)
        {
            char ch = overSignedString[i];
            switch (ch)
            {
                case '0':
                case '1':
                case '2':
                case '3':
                case '4':
                case '5':
                case '6':
                case '7':
                case '8':
                case '9':
                    value = value * 10 + (ch - 0x30);
                    break;

                case '!':
                    value *= 10;
                    return -value;

                case 'J':
                case 'K':
                case 'L':
                case 'M':
                case 'N':
                case 'O':
                case 'P':
                case 'Q':
                case 'R':
                    value = value * 10 + (ch - 'I');
                    return -value;
            }
        }
        return value;
    }

Bear in mind that string manipulations (e.g. Substring) are typically heavy if you need performance.

edited Dec 19 '18 at 09:00

answered Dec 18 '18 at 16:04

Mario Vernari

6,649
1
32
44

Great solution . Need to return the value as negative when char is not a number. – mikes-so Dec 18 '18 at 16:59
Could you fix this for future searches please, Return negative values for numbers that end in a letter. – mikes-so Dec 18 '18 at 23:45
should be correct now. Please, mark this answer as correct. – Mario Vernari Dec 19 '18 at 03:41
The maths is still incorrect for J-R as you are adding the value derived from the letter. Easiest way is to keep your original calc and just return negative result `value = value * 10 + (ch - 'I');` `return -value;` (note the minus sign) – mikes-so Dec 19 '18 at 08:28
Yes makes senseto use the same methodology for both. Strangely this routine seems more CPU intensive than the others. Currently Visual Studio allocation tracking is broken (known issue) but I am presuming this function would be the most GC friendly? – mikes-so Dec 19 '18 at 09:09
This code does not use GC at all: everything (data) are pushed on the stack. The only obvious exception is for the input string, but that's off the code context. If you want to compare your original version and this one, make an app with two loops, each for the code version. Same input, same loop count, just measure time of execution: GC will take longer to execute. – Mario Vernari Dec 19 '18 at 09:55

InBetween · Answer 2 · 2018-12-18T16:05:38.380

2

Switch over the indexed character. Substring is actually alocating a new string and that is slow:

switch (overSignedString[Length - 1])
{
    case ' ':
        return 0;
    case "J":
        return ...

You might want to read this to see if its worth parsing the string inside each case avoiding Convert. There are faster ways.

edited Dec 18 '18 at 16:05

answered Dec 18 '18 at 15:59

InBetween

32,319
3
50
90

Using Indexed character increased the performance from 10 secs down to 7secs.. I also found out this routine potentially iterates over 3 million times . Thank you so much. Will look at the switch statements now. – mikes-so Dec 18 '18 at 16:19

CodeCaster · Answer 3 · 2018-12-18T16:10:17.377

1

Your method is slow because it generates a lot of string garbage.

You could improve it by comparing characters instead of strings, and perform multiplication of the resulting integer instead of appending strings and using a lookup instead of a switch:

private Dictionary<char, int> _additions = new Dictionary<char, int>
{
    { '!', 0 },
    { 'J', 1 },
    { 'K', 2 },
    { 'L', 3 },
    { 'M', 4 },
    { 'N', 5 },
    { 'O', 6 },
    { 'P', 7 },
    { 'Q', 8 },
    { 'R', 9 },
};

private int ConvertOverSign(string overSignedString)
{
    var lastChar = overSignedString[overSignedString.Length -1];

    if (lastChar == ' ')
    {
        return 0;
    }

    if (!_additions.TryGetValue(lastChar, out int addition))
    {
        return Convert.ToInt32(overSignedString);
    }

    var result = (Convert.ToInt32(overSignedString.Substring(0, overSignedString.Length - 1)) * -10) - addition;
    return result;
}

edited Dec 18 '18 at 16:10

answered Dec 18 '18 at 16:04

CodeCaster

147,647
23
218
272

valuable as solution, but dictionaries are heavier and slower than switch. – Mario Vernari Dec 18 '18 at 16:05
1

If that'd be the bottleneck now (can't really spin up a proper benchmark right now), then the OP could translate the dictionary lookup back into a switch. – CodeCaster Dec 18 '18 at 16:06
I told you because I've tried once: a switch is basically a jump table, whereas a dictionary needs a lot of things more to resolve the proper case. – Mario Vernari Dec 18 '18 at 16:08
2

Before Roslyn, if the number of case statements is small, the compiler emits consecutive `if/else` statements. If a large number of sequential cases, a dictionary is used. Roslyn doesn't do this, AFAIK. See https://stackoverflow.com/questions/11617091/in-a-switch-vs-dictionary-for-a-value-of-func-which-is-faster-and-why – Dec 18 '18 at 16:23
@Amy many thanks for the clarification. Did not know the fallback for large sets of cases. – Mario Vernari Dec 18 '18 at 17:11

Parsing routine for over-sign characters relatively slow

3 Answers3