1

I have a comma separated string that I would like to convert into a class. My class definition is like so:

public class Data 
{
    public string Event {get;set;}
    public string TagId {get;set;}
    public string Type {get;set;}
    public string Frequency {get;set;}
    public string Rssi {get;set;}
    public string TxPower {get;set;}
    public string Tid {get;set;}
}

and the string that I get is like so:

event.tag.report tag_id=0x534D43010005600803251100, type=ISOC, antenna=1, frequency=919000, rssi=-451, tx_power=280, tid=0xE2003412012DF30009DA43851F0E0074300541FBFFFFDC50

I can make an array of the string by splitting it on the comma and split it more on the equal sign then get the last value and assign it to the class variable but maybe you have an idea that is faster and better than what I have in mind.

ATTEMPT

private void ParseIntoDataClass (string eventInfo)
{
    var firstArray = eventInfo.Split(',');
    var secondArray = new List<string>();
    var data = new Data();
    foreach (var item in firstArray)
    {
       secondArray.Add(item.Split('=').Last());
    }

    for (int i = 0; i < secondArray.Count - 1; i++)
    {
        switch(i)
        {
            case 0:
                data.Event = secondArray[i];
                break;
            case 1:
                data.TagId = secondArray[i];
                break;
            case 2:
                data.Type = secondArray[i];
                break;
            case 3:
                data.Frequency = secondArray[i];
                break;
            case 4:
                data.Rssi = secondArray[i];
                break;
            case 5:
                data.TxPower = secondArray[i];
                break;
            case 6:
                data.Tid = secondArray[i];
                break;
        }
    }
}
Ibanez1408
  • 4,550
  • 10
  • 59
  • 110
  • Do you have any control over the input format? – Fildor Mar 02 '23 at 10:18
  • 2
    Instead of splitting, I'd try an approach using `Span` if this deserialization is called often and fast. But I'd code both to have a measure of gain running them in benchmarkdotnet against each other. Sometimes results are surprising. – Fildor Mar 02 '23 at 10:20
  • If the fields always are guaranteed to have the same length and position, you could just go with index and length into a `ReadOnlySpan`, but that's a lot to rely on. – Fildor Mar 02 '23 at 10:25
  • I have no control on the string output. – Ibanez1408 Mar 02 '23 at 10:27
  • @Fildor Could you give me sample of how to do what you are suggesting. Please see my edit on how I got to my solution. Maybe you have something better – Ibanez1408 Mar 02 '23 at 10:29
  • 1
    That's *not* a comma-separated string. It's a custom format containing key/value pairs separated by commas. It's unclear if the text contains newline or not. Post an actual example in a code block. You need a custom parser for this but the details will depend on the format itself. It's quite possible a simple regular expression will be enough to read keys and values *without* splitting – Panagiotis Kanavos Mar 02 '23 at 10:32
  • would putting everything in a dictionary work for you instead of a class? – Molbac Mar 02 '23 at 10:34
  • So, your code is assuming fixed order of fields. But it's allocating a lot of intermediary strings. I am on the cell phone right now, so it's really not good for giving a full answer, but I'd give regex a shot, actually. – Fildor Mar 02 '23 at 10:34
  • @Molbac it has to be a class. – Ibanez1408 Mar 02 '23 at 10:35
  • You could use `"(?\w+?)=(?\w+?)` for example to capture keys/values separated by non-word characters into [named groups](https://learn.microsoft.com/en-us/dotnet/standard/base-types/grouping-constructs-in-regular-expressions#named_matched_subexpression) and then extract them by name – Panagiotis Kanavos Mar 02 '23 at 10:36
  • 1
    Is `event.tag.report` the value for `event`? – Panagiotis Kanavos Mar 02 '23 at 11:02
  • I think you want to create an object, not a class. – Jodrell Mar 02 '23 at 11:58

4 Answers4

2

You can try Regex(@"(\w+)\s*=\s*([^,]*)") to split like this

tag_id: 0x534D43010005600803251100
type: ISOC
antenna: 1
frequency: 919000
rssi: -451
tx_power: 280
tid: 0xE2003412012DF30009DA43851F0E0074300541FBFFFFDC50

Full sample:

public static void Main()
{
    string input = "event.tag.report tag_id=0x534D43010005600803251100, type=ISOC, antenna=1, frequency=919000, rssi=-451, tx_power=280, tid=0xE2003412012DF30009DA43851F0E0074300541FBFFFFDC50";

    Dictionary<string, string> parameters = new Dictionary<string, string>();
    Regex regex = new Regex(@"(\w+)\s*=\s*([^,]*)");
    MatchCollection matches = regex.Matches(input);
    foreach (Match match in matches)
    {
        parameters[match.Groups[1].Value] = match.Groups[2].Value;
    }

    // Data
    var data = new Data();
    // data.Event = ?
    data.TagId = parameters["tag_id"];
    data.Type = parameters["type"];
    data.Frequency = parameters["frequency"];
    data.Rssi = parameters["rssi"];
    data.TxPower = parameters["tx_power"];
    data.Tid = parameters["tid"];
}

Updated:

Sorry, I noticed that your Event is not included in the string? Hmmm, the ParseIntoDataClass() will get the tag_id value to Event.

Updated

I think @VadimMartynov's answer is the best, and I tested both the benchmark without using property reflection and the one with property reflection.

Method Mean Error StdDev
ParseIntoDataClass 196.5 ns 3.45 ns 3.23 ns
ParseIntoDataClassByProperty 11,739.1 ns 224.12 ns 209.64 ns

As @VadimMartynov said, this reflection indeed has a more negative impact on performance.

Antony Kao
  • 114
  • 4
  • Dang!!! I AM CLAPPING AND SALUTING! Thank you. – Ibanez1408 Mar 02 '23 at 11:11
  • If you can explain what is happening Sir that would be a lot more awesome! – Ibanez1408 Mar 02 '23 at 11:13
  • 1
    string.Split is generally faster than compiled Regex.Matches for splitting a string into substrings based on a delimiter. string.Split is optimized for splitting a string . It uses a simple loop that searches for the delimiter character(s) and creates a new substring for each section of the original string between the delimiters. This means that it can be very efficient for simple delimiters. Regex matching process involves comparing the input string to the regular expression pattern, which can require a significant amount of computation, especially if the input string is long. – Vadim Martynov Mar 02 '23 at 11:22
  • I would do some timings on this to be sure. I put this code and the OP's code into a Benchmark.Net program and the results were that parsing using the OP's code takes 886ns and the regexp version takes 8750ns (so the regexp is around 10 times slower) – Matthew Watson Mar 02 '23 at 11:48
  • 2
    I also added a benchmark using BenchmarkDotNet to my answer https://stackoverflow.com/a/75614395/5649561 and it turns out that the code with regular expressions is 10 times slower than the code from the original question and 40 times slower than the code using Span. So I think regular expressions are not the best solution in this particular problem. – Vadim Martynov Mar 02 '23 at 12:09
  • you can also notice that the regular expression version consumes a lot more memory (at least for a single line). Here is code of replit.com/@VadimMartynov/SplitStringBenchmark?v=1 – Vadim Martynov Mar 02 '23 at 12:24
  • That "benchmark" doesn't use splitting at all. You're comparing apples to hammers. And if you used ANTLR or FParsec you'd get better performance *and* be able to read the code – Panagiotis Kanavos Mar 02 '23 at 12:35
  • @PanagiotisKanavos I agree, in an ideal benchmark it would be worth comparing more benchmark independent operations, but first, in my humble opinion, in the current question it is important to compare the performance of the whole algorithm, not specific operations. Here's a benchmark that says that string.Split is still significantly more productive than regular expression https://stackoverflow.com/a/58917981/5649561. Second, the very principle of regular expressions cannot allow them to overtake the Span implementation, which is literally a single pass through the list using a loop. – Vadim Martynov Mar 02 '23 at 12:56
  • 1
    @PanagiotisKanavos Did you look at [the source code for the benchmark](https://replit.com/@VadimMartynov/SplitStringBenchmark?v=1#main.cs) ? I see string splitting there. I've performed by own benchmarking, and using regexp is significantly slower, although it uses far less memory. – Matthew Watson Mar 02 '23 at 12:58
  • @MatthewWatson yeah there is a method from original question with string split in this benchmark to compare a regex implementation with baseline. Regex can't be faster by design and memory usage actually depends globally on the way the strings in the original question are handled, read, and stored. In any case, iterating with Span will be the least memory-consuming since it does not create any objects other than the needed ones at all and only goes over the string once. Regular expressions are almost always a terrible performance solution. – Vadim Martynov Mar 02 '23 at 13:05
  • Thank you all for your comments, I also learned a lot! – Antony Kao Mar 03 '23 at 03:54
  • I'm sorry that my REPUTATION is below 50, so I cannot comment on your answers. – Antony Kao Mar 03 '23 at 03:57
2

You can write a method to extract the name/value pairs from the input by returning ranges for the name and value strings, like so:

public static IEnumerable<(Range name, Range value)> ExtractNameValuePairs(string input)
{
    int nameStart = 0;
    int nameEnd   = 0;

    for (int i = 0; i < input.Length; i++)
    {
        switch (input[i])
        {
            case ' ':
                nameStart = i + 1;
                break;

            case '=':
                nameEnd = i;
                break;

            case ',':
                yield return (new Range(nameStart, nameEnd), new Range(nameEnd + 1, i));
                break;
        }
    }

    yield return (new Range(nameStart, nameEnd), new Range(nameEnd + 1, input.Length));
}

You can then use that to parse the data into the correct parameters like so:

static void ParseUsingRangesAndCheckingNames(string input)
{
    var data = new Data(); // data.Event is what?
    
    foreach (var nvp in ExtractNameValuePairs(input))
    {
        switch (input[nvp.name])
        {
            case "tag_id"   : data.TagId     = input[nvp.value]; break;
            case "type"     : data.Type      = input[nvp.value]; break;
            case "frequency": data.Frequency = input[nvp.value]; break;
            case "rssi"     : data.Rssi      = input[nvp.value]; break;
            case "tx_power" : data.TxPower   = input[nvp.value]; break;
            case "tid"      : data.Tid       = input[nvp.value]; break;
        }
    }
}

If you know that the data will always be in the order that you specified and have all the name/value pairs from your example you can optimise that a bit to:

static void ParseUsingRanges(string input)
{
    var data = new Data(); // data.Event is what?

    using var iter = ExtractNameValuePairs(input).GetEnumerator();

    iter.MoveNext();
    data.TagId = input[iter.Current.value];
    
    iter.MoveNext();
    data.Type = input[iter.Current.value];
    
    iter.MoveNext(); // Skip antenna
    iter.MoveNext();
    data.Frequency = input[iter.Current.value];

    iter.MoveNext();
    data.Rssi = input[iter.Current.value];

    iter.MoveNext();
    data.TxPower = input[iter.Current.value];

    iter.MoveNext();
    data.Tid = input[iter.Current.value];
}

but obviously that's a lot more brittle because it assumes the order and the presence of the various name/value pairs.

In order to benchmark this I used the following code:

using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;

namespace Console1;

[MemoryDiagnoser]
public class Benchmarks
{
    [Benchmark]
    public void ParseUsingStringSplit()
    {
        ParseIntoDataClass(data);
    }

    [Benchmark]
    public void ParseUsingRegExp()
    {
        ParseUsingRegExp(data);
    }

    [Benchmark]
    public void ParseUsingRanges()
    {
        ParseUsingRanges(data);
    }

    [Benchmark]
    public void ParseUsingRangesAndCheckingNames()
    {
        ParseUsingRangesAndCheckingNames(data);
    }

    static void ParseIntoDataClass(string eventInfo)
    {
        var firstArray  = eventInfo.Split(',');
        var secondArray = new List<string>();
        var data        = new Data();

        foreach (var item in firstArray)
        {
            secondArray.Add(item.Split('=').Last());
        }

        for (int i = 0; i < secondArray.Count - 1; i++)
        {
            switch (i)
            {
                case 0:
                    data.Event = secondArray[i];
                    break;
                case 1:
                    data.TagId = secondArray[i];
                    break;
                case 2:
                    data.Type = secondArray[i];
                    break;
                case 3:
                    data.Frequency = secondArray[i];
                    break;
                case 4:
                    data.Rssi = secondArray[i];
                    break;
                case 5:
                    data.TxPower = secondArray[i];
                    break;
                case 6:
                    data.Tid = secondArray[i];
                    break;
            }
        }
    }

    static void ParseUsingRegExp(string input)
    {
        Dictionary<string, string> parameters = new Dictionary<string, string>();
        Regex                      regex      = new Regex(@"(\w+)\s*=\s*([^,]*)");
        MatchCollection            matches    = regex.Matches(input);

        foreach (Match match in matches)
        {
            parameters[match.Groups[1].Value] = match.Groups[2].Value;
        }

        // Data
        var data = new Data();
        // data.Event = ?
        data.TagId     = parameters["tag_id"];
        data.Type      = parameters["type"];
        data.Frequency = parameters["frequency"];
        data.Rssi      = parameters["rssi"];
        data.TxPower   = parameters["tx_power"];
        data.Tid       = parameters["tid"];
    }

    static void ParseUsingRanges(string input)
    {
        var data = new Data(); // data.Event is what?

        using var iter = ExtractNameValuePairs(input).GetEnumerator();

        iter.MoveNext();
        data.TagId = input[iter.Current.value];
    
        iter.MoveNext();
        data.Type = input[iter.Current.value];
    
        iter.MoveNext(); // Skip antenna
        iter.MoveNext();
        data.Frequency = input[iter.Current.value];

        iter.MoveNext();
        data.Rssi = input[iter.Current.value];

        iter.MoveNext();
        data.TxPower = input[iter.Current.value];

        iter.MoveNext();
        data.Tid = input[iter.Current.value];
    }

    static void ParseUsingRangesAndCheckingNames(string input)
    {
        var data = new Data(); // data.Event is what?
    
        foreach (var nvp in ExtractNameValuePairs(input))
        {
            switch (input[nvp.name])
            {
                case "tag_id"   : data.TagId     = input[nvp.value]; break;
                case "type"     : data.Type      = input[nvp.value]; break;
                case "frequency": data.Frequency = input[nvp.value]; break;
                case "rssi"     : data.Rssi      = input[nvp.value]; break;
                case "tx_power" : data.TxPower   = input[nvp.value]; break;
                case "tid"      : data.Tid       = input[nvp.value]; break;
            }
        }
    }

    public static IEnumerable<(Range name, Range value)> ExtractNameValuePairs(string input)
    {
        int nameStart = 0;
        int nameEnd   = 0;

        for (int i = 0; i < input.Length; i++)
        {
            switch (input[i])
            {
                case ' ':
                    nameStart = i + 1;
                    break;

                case '=':
                    nameEnd = i;
                    break;

                case ',':
                    yield return (new Range(nameStart, nameEnd), new Range(nameEnd + 1, i));
                    break;
            }
        }

        yield return (new Range(nameStart, nameEnd), new Range(nameEnd + 1, input.Length));
    }

    readonly string data = "event.tag.report tag_id=0x534D43010005600803251100, type=ISOC, antenna=1, frequency=919000, rssi=-451, tx_power=280, tid=0xE2003412012DF30009DA43851F0E0074300541FBFFFFDC50";
}

public class Data
{
    public string? Event     { get; set; }
    public string? TagId     { get; set; }
    public string? Type      { get; set; }
    public string? Frequency { get; set; }
    public string? Rssi      { get; set; }
    public string? TxPower   { get; set; }
    public string? Tid       { get; set; }
}

With the following results:

|                           Method |       Mean |     Error |    StdDev |     Median |   Gen0 | Allocated |
|--------------------------------- |-----------:|----------:|----------:|-----------:|-------:|----------:|
|            ParseUsingStringSplit |   969.9 ns |  36.98 ns | 106.71 ns |   942.9 ns | 0.4253 |    1784 B |
|                 ParseUsingRegExp | 8,965.3 ns | 179.23 ns | 349.57 ns | 8,867.3 ns | 2.1362 |    8960 B |
|                 ParseUsingRanges |   450.8 ns |   8.92 ns |  13.62 ns |   447.3 ns | 0.1163 |     488 B |
| ParseUsingRangesAndCheckingNames |   575.3 ns |  13.24 ns |  38.19 ns |   564.2 ns | 0.1774 |     744 B |
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
0

i would do something like

private Data ParseIntoDataClass(string eventInfo)
{
    var pair = eventInfo.Split(',');
    var keyandValueDic = new Dictionary<string, string>();
    foreaach(var item in pair)
    {
        var keyandValue = eventInfo.Split('=');
        var key = keyandValue[0];
        var value = keyandValue[1];
        keyandValueDic.Add(key, value);
    }
 
    Data data = new Data();
    //may dic of same with values to class
    Foreach(var item in keyandValueDic)
    {
        ParseDicIntoDataClass(item, data);
    }   
    return data;
}

//match on name and do any parsing of types if required
private Data ParseDicIntoDataClass(Dictionary<string, string> dic, Data data)
{
    foreaach(var item in keyandValueDic)
    {
        if(item.key == "event"{
            data.Event = Item.Value;
        }
        else if(
        ect....
    }   
}
Seabizkit
  • 2,417
  • 2
  • 15
  • 32
0

The delay is caused by the way the string is parsed, not by assigning the properties. Strings are immutable so each string operation creates a new temporary strings that needs to be allocated and eventually garbage-collected.

Improve parsing performance

One way to improve this is to use a regular expression to capture they key/value pairs without splitting the string. The expression (?<key>\w+?)\s*=\s*(?<value>\w+?) will capture key/value pairs in named groups. \w+? matches any word character in a non-greedy way.

var regex=new Regex(@"(?<key>\w+?)\s*=\s*(?<value>\w+?)");

IEnumerable<string> lines=File.ReadLines(path);

foreach(var line in lines)
{
    foreach(var match in regex.Matches(line))
    {
        var key=match.Groups["key"].Value;
        var value=match.Groups["value"].Value;
        ...
    }
}

or

foreach(var line in lines)
{
    Dictionary<string,string> dict=regex.Matches(line)
                                        .ToDictionary(m=>m.Groups["key"].Value,
                                                      m=>m.Groups["value"].Value);
    ...
}

Regex doesn't split the string. Essentially, it finds the indexes in the source string where each match and group starts and ends. No strings are created until Value is called.

You could even turn this code into a LINQ query:

var recordList=lines.Select(l=>regex.Matches(l))
                    .Select(m=>.ToDictionary(m=>m.Groups["key"].Value,                                                      
                              m=>m.Groups["value"].Value)
                     )
                    .ToList();

You can even filter using Where before calling ToList

Create the final objects

If all tags are required, creating the data object is easy once a dictionary is available :

var records=lines.Select(l=>regex.Matches(l))
                 .Select(m=>.ToDictionary(m=>m.Groups["key"].Value,                                                      
                              m=>m.Groups["value"].Value)
                  )
                  .Select(dict=>new Data{
                      TagId     = dict["tag_id"],
                      Type      = dict["type"],
                      Frequency = dict["frequency"],
                      ...
                  })
                  .ToList();
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • 1
    string.Split is generally faster than compiled Regex.Matches for splitting a string into substrings based on a delimiter. string.Split is optimized for splitting a string . It uses a simple loop that searches for the delimiter character(s) and creates a new substring for each section of the original string between the delimiters. This means that it can be very efficient for simple delimiters. Regex matching process involves comparing the input string to the regular expression pattern, which can require a significant amount of computation, especially if the input string is long. – Vadim Martynov Mar 02 '23 at 11:20
  • @VadimMartynov no it's not. What you described is several times slower than a regex. Far worse, splitting generates a lot of temporary strings that need to be allocated and garbage collected. Try replacing splitting with regular expressions when parsing a large file and see the difference. 5-10x less RAM and higher throughput are quite common. – Panagiotis Kanavos Mar 02 '23 at 11:29
  • @VadimMartynov regular expressions don't work the way you describe either. They *compile* the expression and compare the stream of characters against the compiled expression. They generate indexes to matches, not substrings, so no temp strings are generated until you actually try to do something with the matched data – Panagiotis Kanavos Mar 02 '23 at 11:34
  • @VadimMartynov the question's format isn't simple separators anyway. I once had to parse a similar log files with 200K entries. String splitting quickly ate up 2GB of RAM and took a long time to finish. Using regular expressions never reached 200MB and run 10x faster. You can improve perfromance even farther if you create a parser just for this format, eg using parser combinators, or a hand-rolled parser that iterates and inspects the characters. – Panagiotis Kanavos Mar 02 '23 at 11:38
  • @VadimMartynov string splitting in this case would result in at least 2x the RAM because it would allocate almost 1x the RAM per line for the pairs, then another 1x for the key/values. The keys and values are what we want so they don't count. – Panagiotis Kanavos Mar 02 '23 at 11:41
  • @PanagiotisKanavos ok you're right. But I added a benchmark using BenchmarkDotNet to my answer and it turns out that the code with regular expressions is 10 times slower than the code from the original question and 40 times slower than the code using Span. So I think regular expressions are not the best solution in this particular problem. – Vadim Martynov Mar 02 '23 at 12:08
  • 1
    @PanagiotisKanavos you can also notice that the regular expression version consumes a lot more memory (at least for a single line). Here is code of https://replit.com/@VadimMartynov/SplitStringBenchmark?v=1 – Vadim Martynov Mar 02 '23 at 12:22
  • 1
    That's not string splitting at all. That's the raw `inspect individual characters in place` part. Yes, if you remove temporary strings you get faster parsing and less RAM usage. And if you use a parser specifically for this format you'll get even faster parsing and less usage because eg you won't have to go back and forth or trim – Panagiotis Kanavos Mar 02 '23 at 12:27
  • 1
    @PanagiotisKanavos yes a parser specifically for this format is a good solution and regular expressions are still a very bad solution for such a problem – Vadim Martynov Mar 02 '23 at 13:59