Extract keywords from strings and assign them to properties in C#

Question

I need to transfer data from MATLAB to a C# software. The data from MATLAB should also be editable offline (i.e. outside MATLAB and the C# software). To achieve this, my MATLAB code prints the data with readable patterns to a text file. For example:

<L> pt: [0.001,2,3], spd: 100, cfg: fut, daq: on, id: [1,1] </L>
<L> pt: [0.002,3,4], cfg: nut, spd: 100, id: [1,1], daq: on</L>
<C> pt: [0.02,5,3], spd: 100, daq: on, id: [1,1] </C>
<L> pt: [1.002,3,4], spd: 100, daq: off</L>

In C#, I want to parse each line, extract these keywords and assign them to properties:

enum PathType { L, C}
class Path
{
    public PathType Type { get; set; }
    public float[] Pt { get; set; }
    public int Spd { get; set; }
    public string Cfg { get; set; }
    public bool Daq { get; set; }
    public int[] Id { get; set; }
}

So for Line 1, I intend to have something look like this:

var path = new Path {
    PathType = PathType.L,
    Pt = new []{ 0.001, 2, 3 },
    Spd = 100,
    Cfg = "fut",
    Daq = true,
    Id = new []{ 1, 1 }};

for Line 4:

var path = new Path {
    PathType = PathType.L,
    Pt = new []{ 1.002, 3, 4 },
    Spd = 100,
    Cfg = null,
    Daq = false,
    Id = null;

Since the keywords are arranged in different order and may not appear in all lines, I can't use a single regular expression to extract these information. I have to use multiple regular expressions to test each line:

    var typeReg = new Regex(@"<(\w+)>");
    var ptReg = new Regex(@"pt:\s+(?<open>\[)[^\[]*(?<close-open>\])(?(open))");
    var spdReg = new Regex(@"spd:\s+(\d+)");
    var cfgReg = new Regex(@"cfg:\s+(fut|nut)");
    var daqReg = new Regex(@"daq:\s+(on|off)");
    var idReg = new Regex(@"id:\s+(?<open>\[)[^\[]*(?<close-open>\])(?(open))");

This works but I'm wondering if there is any better way of doing this?

Shall I print the data in a different pattern such as:

L; pt: [0.001,2,3]; spd: 100; cfg: fut; daq: on; id: [1,1]

Then I can split the string with delimiter ; and then check each substring with x.StartWith('...'). But this way, I feel it is not as readable as the current pattern.

I do not want to use xml since it will make the text file bigger than the desired size.

MATLAB is creating those strings, or you create them yourself. If you are generating them that you're parsing, then use JSON instead of XML if you're concerned about the file size. If MATLAB is generating them like that, then you need to parse out the split string like you said. — krillgar, Dec 04 '17 at 19:17
How big are the data files in their current form? In terms of size and number of objects. — Lasse V. Karlsen, Dec 04 '17 at 19:19
@krillgar I wrote a code in MATLAB that creates the strings. I can always change them to fit my C# code better. JSON sounds good. I will give it a try. — Anthony, Dec 04 '17 at 19:29
@LasseVågsætherKarlsen a few KB at the moment. I don't want them to grow into a few MB :D — Anthony, Dec 04 '17 at 19:29

Anthony · Accepted Answer · 2017-12-11T14:07:58.413

I compared JSON and XML and found XML is better. I had a poor understanding of XML format and thought it would significantly increase file size, which is wrong.

The JSON implementation

This method requires a NuGet package, Newtonsoft.Json. There are also other solutions that do not require it.

MATLAB output:

{"type":"L", "pt": "[0.001,2,3]", "spd": "100", "cfg": "fut", "daq": "on", "id": "[1,1]"}

C# code for decoding JSON is as simple as:

using Newtonsoft.Json;
public void Dictionary<string,string> DecodeJson(script)
{
    return JsonConvert.DeserializeObject(script);
}

The returned object is a dictionary consists of attribute names, i.e. type, pt, spd etc., as keys and attribute values as values.

Decoding 100k of the example output requires roughly 330 ms.

The XML implementation

MATLAB output:

<L pt='[0.001,2,3]' spd='100' cfg='fut' daq='on' id='[1,1]'  />

C# code:

public void Dictionary<string,string> DecodeXml(script)
{    
    var xmlObj = new Dictionary<string, string>();
    using (var reader = XmlReader.Create(new StringReader(script)))
    {
        reader.Read();
        // add node name, i.e. xmlObj["type"] -> L
        xmlObj.Add("type",reader.Name);
        // add all attributes
        while (reader.MoveToNextAttribute())
        {
            xmlObj.Add(reader.Name, reader.Value);
        }
    }
    return xmlObj;
}

This method returns the same object as the JSON method.

Decoding 100k of the example output requires roughly 180 ms.

Conclusion

The input string for the xml implementation is shorter than that for the JSON implementation, hence smaller file size. The execution time of the XML code is almost 2x faster than that of the JSON code. Therefore, XML is a better choice.

Extract keywords from strings and assign them to properties in C#

1 Answers1