4

I'm working with a rather large set of strings I have to process as quickly as possible.

The format is quite fixed:

[name]/[type]:[list ([key] = [value],)]

or

 [name]/[type]:[key]

I hope my representation is okay. What this means is that I have a word (in my case I call it Name), then a slash, followed by another word (I call it Type), then a colon, and it is either followed by a comma-separated list of key-value pairs (key = value), or a single key.

Name, Type cannot contain any whitespaces, however the key and value fields can.

Currently I'm using Regex to parse this data, and a split:

var regex = @"(\w+)\/(\w+):(.*)";
var r = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.Singleline);
var m = r.Match(Id);
if (m.Success) {
    Name = m.Groups[1].Value;
    Type= m.Groups[2].Value;

    foreach (var intern in m.Groups[3].Value.Split(','))
    {
        var split = intern.Trim().Split('=');
        if (split.Length == 2)
            Items.Add(split[0], split[1]);
        else if (split.Length == 1)
            Items.Add(split[0], split[0]);
    }
}

Now I know this is not the most optional case, but I'm not sure which would be the fastest:

  • Split the string first by the : then by / for the first element, and , for the second, then process the latter list and split again by =
  • Use the current mixture as it is
  • Use a completely regex-based

Of course I'm open to suggestions, my main goal is to achieve the fastest processing of this single string.

fonix232
  • 2,132
  • 6
  • 39
  • 69
  • If the keys or values can have `:` or `/`, you should stick to the regex. – Wiktor Stribiżew Dec 16 '16 at 12:57
  • 3
    _"my main goal is to achieve the fastest processing of this single string"_ - so **measure** the different approaches. You can use the Stopwatch class for this: best off to do it in Release configuration, and to run the tests multiple times and use the average result. – stuartd Dec 16 '16 at 12:57
  • 1
    regex is usually not an optimal solution as it's built for general use cases (generic code is usually not optimal). You may need to consider writing your own code to do this – Khanh TO Dec 16 '16 at 12:57
  • Do you actually have performance issues with the current code ? Do you have profiled and found that it's that specific part which is in cause ? – Sehnsucht Dec 16 '16 at 12:57
  • 1
    Possible duplicate of [String.Split VS. Regex.Split?](http://stackoverflow.com/questions/3601465/string-split-vs-regex-split) – Manfred Radlwimmer Dec 16 '16 at 12:58
  • Split in general should be faster. Deciding what to use will come down to what restrictions everything has. If : or / can be in Name/Type or commas in Key/Value you should use regex – Mats391 Dec 16 '16 at 12:59
  • 3
    Using a bunch of `IndexOf` statements might turn out to be even faster, as no arrays are created. The added complexity might not be worth the small performance gain, though. – C.Evenhuis Dec 16 '16 at 13:01
  • I had the chance to remove two other, similar regexes from the code that were being used in place of simple splitting, and that sped up total code execution by 50% (these come from a file, and are only parts of the file. I'm doing quite heavy XML parsing and even with that, removing ONE single regex and adding three splits sped things up by 50%). – fonix232 Dec 16 '16 at 13:01
  • If you do use only `Split` make sure to use the overload that lets you specify the maximum number of items to return when splitting on the slash and colon, that way you don't have to worry about it splitting on them in the keys or the values if they can contain slashes or colons. If you want to stick with regex I suggest you include `RegexOptions.Compiled` to sped it up. – juharr Dec 16 '16 at 13:02
  • you may write just 1 loop to go though all the characters and collect the words along the way, it should be the quickest – Khanh TO Dec 16 '16 at 13:02
  • Keys cannot contain `:` or `/`, at least not in the 50 million or so samples I've processed so far. – fonix232 Dec 16 '16 at 13:02
  • Where is this data coming from? It looks similar to JSON. – Jake Steffen Dec 16 '16 at 13:32
  • I'm not at liberty to disclose the source of data. – fonix232 Dec 16 '16 at 16:22

1 Answers1

1

Its always fun to implement a custom parser. Obviously concerning code maintenance, Regex is probably the best choice, but if performance is an ultimate concern then you probably need a tailor made parser which even in the simplest syntaxes is quite a lot more of work.

I've whipped one up really quick (it might be a little hackish in some places) to see what it would take to implement one with some basic error recovery and information. This isn't tested in any way but I'd be curious, if its minimally functional, to know how well it stacks up with the Regex solution in terms of performance.

public class ParserOutput
{
    public string Name { get; }
    public string Type { get; }
    public IEnumerable<Tuple<string, string>> KeyValuePairs { get; }
    public bool ContainsKeyValuePairs { get; }
    public bool HasErrors { get; }
    public IEnumerable<string> ErrorDescriptions { get; }

    public ParserOutput(string name, string type, IEnumerable<Tuple<string, string>> keyValuePairs, IEnumerable<string> errorDescriptions)
    {
        Name = name;
        Type = type;
        KeyValuePairs = keyValuePairs;
        ContainsKeyValuePairs = keyValuePairs.FirstOrDefault()?.Item2?.Length > 0;
        ErrorDescriptions = errorDescriptions;
        HasErrors = errorDescriptions.Any();
    }
}

public class CustomParser
{
    private const char forwardSlash = '/';
    private const char colon = ':';
    private const char space = ' ';
    private const char equals = '=';
    private const char comma = ',';

    StringBuilder buffer = new StringBuilder();

    public ParserOutput Parse(string input)
    {
        var diagnosticsBag = new Queue<string>();

        using (var enumerator = input.GetEnumerator())
        {
            var name = ParseToken(enumerator, forwardSlash, diagnosticsBag);
            var type = ParseToken(enumerator, colon, diagnosticsBag);
            var keyValuePairs = ParseListOrKey(enumerator, diagnosticsBag);

            if (name.Length == 0)
            {
                diagnosticsBag.Enqueue("Input has incorrect format. Name could not be parsed.");
            }

            if (type.Length == 0)
            {
                diagnosticsBag.Enqueue("Input has incorrect format. Type could not be parsed.");
            }

            if (!keyValuePairs.Any() ||
                input.Last() == comma /*trailing comma is error?*/)
            {
                diagnosticsBag.Enqueue("Input has incorrect format. Key / Value pairs could not be parsed.");
            }

            return new ParserOutput(name, type, keyValuePairs, diagnosticsBag);
        }
    }

    private string ParseToken(IEnumerator<char> enumerator, char separator, Queue<string> diagnosticsBag)
    {
        buffer.Clear();
        var allowWhitespaces = separator != forwardSlash && separator != colon;

        while (enumerator.MoveNext())
        {
            if (enumerator.Current == space && !allowWhitespaces)
            {
                diagnosticsBag.Enqueue($"Input has incorrect format. {(separator == forwardSlash ? "Name" : "Type")} cannot contain whitespaces.");
            }
            else if (enumerator.Current != separator)
            {
                buffer.Append(enumerator.Current);
            }
            else
                return buffer.ToString();
        }

        return buffer.ToString();
    }

    private IEnumerable<Tuple<string, string>> ParseListOrKey(IEnumerator<char> enumerator, Queue<string> diagnosticsBag)
    {
        buffer.Clear();
        var isList = false;

        while (true)
        {
            var key = ParseToken(enumerator, equals, diagnosticsBag);
            var value = ParseToken(enumerator, comma, diagnosticsBag);

            if (key.Length == 0)
                break;

            yield return new Tuple<string, string>(key, value);

            if (!isList && value.Length != 0)
            {
                isList = true;
            }
            else if (isList && value.Length == 0)
            {
                diagnosticsBag.Enqueue($"Input has incorrect format: malformed [key / value] list.");
            }
        }
    }
}
InBetween
  • 32,319
  • 3
  • 50
  • 90