2

I would like to write C# code that parses nested parenthesis to array elements, but only on first level. An example is needed for sure:

I want this string:

"(example (to (parsing nested paren) but) (first lvl only))"

tp be parsed into:

["example", "(to (parsing nested paren) but)", "(first lvl only)"]

I was thinking about using regex but can't figure out how to properly use them without implementing this behaviour from scratch.

In the case of malformed inputs I would like to return an empty array, or an array ["error"]

chendoy
  • 155
  • 1
  • 2
  • 12
  • What about malformed strings? What would `hello (world` output? – Sweeper May 31 '20 at 13:45
  • 2
    Go through the characters. Increment a counter each time you encounter a `(`. Decrement the counter each time you encounter a `)`. When the counter changes from 0 to 1, start a new element. When the counter goes from 1 to 0, finish the current element. If the counter is equal to one and the current character is a space, start a new element. In all other cases add the current character to the current element. If in the end the counter is not equal 0, return an error. – GSerg May 31 '20 at 13:58
  • yea this sounds reasonable but takes time to debug. Is there a way using Regex maybe? – chendoy May 31 '20 at 16:06
  • according to your desired output, if we apply a regex on your input string, it should return something like this not what you wrote there as your output `["(example)", "(to but)", "(parsing nested parent)", "(first lvl only)"]`. I think you should rethink your desired solution again – Formula12 May 31 '20 at 16:19
  • @user112112 I can post a regex code if you want? (i did previously posted but it was downvoted maybe because you did not want regex so I deleted it) – Ali Kleit May 31 '20 at 16:24
  • `but takes time to debug` - yes, that would be "doing your bit". Otherwise see https://stackoverflow.com/q/14952113/11683 and https://stackoverflow.com/q/25239065/11683. – GSerg May 31 '20 at 17:16
  • Does this answer your question? [How can I match nested brackets using regex?](https://stackoverflow.com/questions/14952113/how-can-i-match-nested-brackets-using-regex) – Dour High Arch May 31 '20 at 17:19

3 Answers3

1

well, regex will do the job:

var text = @"(example (to (parsing nested paren) but) (first lvl only))";
var pattern = @"\(([\w\s]+) (\([\w\s]+ \([\w\s]+\) [\w\s]+\)) (\([\w\s]+\))\)*";

try
{
    Regex r = new Regex(pattern, RegexOptions.IgnoreCase);      

    Match m = r.Match(text);

    string group_1 = m.Groups[1].Value; //example
    string group_2 = m.Groups[2].Value; //(to (parsing nested paren) but)
    string group_3 = m.Groups[3].Value; //(first lvl only)

    return new string[]{group_1,group_2,group_3};

}
catch(Exception ex){

    return new string[]{"error"};
}

hopefully this helps, tested here in dotnetfiddle

Edit:

this might get you started into building the right expression according to whatever patterns you are falling into and maybe build a recursive function to parse the rest into the desired output :)

Ali Kleit
  • 3,069
  • 2
  • 23
  • 39
  • This only parses the specific string `"(example (to (parsing nested paren) but) (first lvl only))"` and nothing else. It is baffling that this is upvoted. – GSerg May 31 '20 at 17:11
  • @GSerg you can introduce a `level` parameter to recursively use that function, hope that is useful :) – Ali Kleit May 31 '20 at 17:12
  • What's the point of having a function if I need to manually create a new pattern for each new piece of data? – GSerg May 31 '20 at 17:15
  • well, the function can be changed to the desired output (maybe return an empty string in case of no match and recursively parse the rest), but according to his question, he needed that result. and to **get the first level only**. plus i think that would help for the next steps (if any)! – Ali Kleit May 31 '20 at 17:20
  • and regarding the **pattern**, it's not harmful to **change the expression dynamically according to specific criteria** (maybe level nb) once known how to make use of regex in that problem – Ali Kleit May 31 '20 at 17:34
  • What will your function return for the input of `"(example example2 (example 3))"`? – GSerg May 31 '20 at 18:52
  • Like I said, "that expression would help for the next steps" as in "help in building the right function to test on different patterns". – Ali Kleit May 31 '20 at 19:13
1

RegEx is not recursive. You either count bracket level, or recurse.

An non-recursive parser loop I tested for the example you show is..

    string SplitFirstLevel(string s)
    {
        List<string> result = new List<string>();
        int p = 0, level = 0;
        for (int i = 0; i < s.Length; i++)
        {
            if (s[i] == '(') 
            { 
                level++;
                if (level == 1) p = i + 1;
                if (level == 2)
                {
                    result.Add('"' + s.Substring(p, i - p) + '"');
                    p = i; 
                }                   
            }
            if (s[i] == ')')
                if (--level == 0)
                  result.Add('"' + s.Substring(p, i - p) + '"');
        }
        return "[" + String.Join(",", result) + "]";
    }

Note: after some more testing, I see your specification is unclear. How to delimit orphaned level 1 terms, that is terms without bracketing ?

For example, my parser translates

   (example (to (parsing nested paren) but) (first lvl only))
   to:
   ["example ","(to (parsing nested paren) but) ","(first lvl only)"]

and

   (example (to (parsing nested paren)) but (first lvl only))
   to:
   ["example ","(to (parsing nested paren)) but ","(first lvl only)"]

In either case, "example" gets a separate term, while "but" is grouped with the first term. In the first example this is logical, it is in the bracketing, but it may be unwanted behaviour in the second case, where "but" should be separated, like "example", which also has no bracketing (?)

Goodies
  • 1,951
  • 21
  • 26
1

I developed a parser for your example. I also checked some other examples which you can see in the code.

   using System;
using System.Collections;
using System.Collections.Generic;

public class Program
{
    public static void Main()
    {
        string str = "(example (to (parsing nested paren) but) (first lvl only))"; // => [example  , (to (parsing nested paren) but) ,  (first lvl only)]
        //string str = "(first)(second)(third)"; // => [first , second , third]
        //string str = "(first(second)third)"; // => [first , (second) , third]
        //string str = "(first(second)(third)fourth)"; // => [first , (second) , (third) ,  fourth]
        //string str = "(first((second)(third))fourth)"; // => [first , ((second)(third)) , fourth]

        //string str = "just Text"; // => [ERROR]   
        //string str = "start with Text (first , second)"; // => [ERROR]
        //string str = "(first , second) end with text"; // => [ERROR]
        //string str = ""; // => [ERROR]    
        //string str = "("; // => [ERROR]
        //string str = "(first()(second)(third))fourth)"; // => [ERROR]
        //string str = "(((extra close pareanthese))))"; // => [ERROR]  
        var res = Parser.parse(str);
        showRes(res);
    }

    static void showRes(ArrayList res)
    {
        var strings = res.ToArray();
        var theString = string.Join(" , ", strings);
        Console.WriteLine("[" + theString + "]");
    }
}

public class Parser
{
    static Dictionary<TokenType, TokenType> getRules()
    {
        var rules = new Dictionary<TokenType, TokenType>();
        rules.Add(TokenType.OPEN_PARENTHESE, TokenType.START | TokenType.OPEN_PARENTHESE | TokenType.CLOSE_PARENTHESE | TokenType.SIMPLE_TEXT);
        rules.Add(TokenType.CLOSE_PARENTHESE, TokenType.SIMPLE_TEXT | TokenType.CLOSE_PARENTHESE);
        rules.Add(TokenType.SIMPLE_TEXT, TokenType.SIMPLE_TEXT | TokenType.CLOSE_PARENTHESE | TokenType.OPEN_PARENTHESE);
        rules.Add(TokenType.END, TokenType.CLOSE_PARENTHESE);
        return rules;
    }

    static bool isValid(Token prev, Token cur)
    {
        var rules = Parser.getRules();
        return rules.ContainsKey(cur.type) && ((prev.type & rules[cur.type]) == prev.type);
    }

    public static ArrayList parse(string sourceText)
    {
        ArrayList result = new ArrayList();
        int openParenthesesCount = 0;
        Lexer lexer = new Lexer(sourceText);
        Token prevToken = lexer.getStartToken();
        Token currentToken = lexer.readNextToken();
        string tmpText = "";
        while (currentToken.type != TokenType.END)
        {
            if (currentToken.type == TokenType.OPEN_PARENTHESE)
            {
                openParenthesesCount++;
                if (openParenthesesCount > 1)
                {
                    tmpText += currentToken.token;
                }
            }
            else if (currentToken.type == TokenType.CLOSE_PARENTHESE)
            {
                openParenthesesCount--;
                if (openParenthesesCount < 0)
                {
                    return Parser.Error();
                }

                if (openParenthesesCount > 0)
                {
                    tmpText += currentToken.token;
                }
            }
            else if (currentToken.type == TokenType.SIMPLE_TEXT)
            {
                tmpText += currentToken.token;
            }

            if (!Parser.isValid(prevToken, currentToken))
            {
                return Parser.Error();
            }

            if (openParenthesesCount == 1 && tmpText.Trim() != "")
            {
                result.Add(tmpText);
                tmpText = "";
            }

            prevToken = currentToken;
            currentToken = lexer.readNextToken();
        }

        if (openParenthesesCount != 0)
        {
            return Parser.Error();
        }

        if (!Parser.isValid(prevToken, currentToken))
        {
            return Parser.Error();
        }

        if (tmpText.Trim() != "")
        {
            result.Add(tmpText);
        }

        return result;
    }

    static ArrayList Error()
    {
        var er = new ArrayList();
        er.Add("ERROR");
        return er;
    }
}

class Lexer
{
    string _txt;
    int _index;
    public Lexer(string text)
    {
        this._index = 0;
        this._txt = text;
    }

    public Token getStartToken()
    {
        return new Token(-1, TokenType.START, "");
    }

    public Token readNextToken()
    {
        if (this._index >= this._txt.Length)
        {
            return new Token(-1, TokenType.END, "");
        }

        Token t = null;
        string txt = "";
        if (this._txt[this._index] == '(')
        {
            txt = "(";
            t = new Token(this._index, TokenType.OPEN_PARENTHESE, txt);
        }
        else if (this._txt[this._index] == ')')
        {
            txt = ")";
            t = new Token(this._index, TokenType.CLOSE_PARENTHESE, txt);
        }
        else
        {
            txt = this._readText();
            t = new Token(this._index, TokenType.SIMPLE_TEXT, txt);
        }

        this._index += txt.Length;
        return t;
    }

    private string _readText()
    {
        string txt = "";
        int i = this._index;
        while (i < this._txt.Length && this._txt[i] != '(' && this._txt[i] != ')')
        {
            txt = txt + this._txt[i];
            i++;
        }

        return txt;
    }
}

class Token
{
    public int position
    {
        get;
        private set;
    }

    public TokenType type
    {
        get;
        private set;
    }

    public string token
    {
        get;
        private set;
    }

    public Token(int position, TokenType type, string token)
    {
        this.position = position;
        this.type = type;
        this.token = token;
    }
}

[Flags]
enum TokenType
{
    START = 1,
    OPEN_PARENTHESE = 2,
    SIMPLE_TEXT = 4,
    CLOSE_PARENTHESE = 8,
    END = 16
}
Ehsan Nazeri
  • 771
  • 1
  • 6
  • 10