3

I am looking to create a IRC-like command format:

/commandname parameter1 "parameter 2" "parameter \"3\"" parameter"4 parameter\"5

Which would (ideally) give me a list of parameters:

parameter1
parameter 2
parameter "3"
parameter"4
parameter\"5

Now from what I have read, this isn't at all trivial and might as well be done in some other method.

Thoughts?

Below is C# code that does the job I need:

public List<string> ParseIrcCommand(string command)
    {
        command = command.Trim();
        command = command.TrimStart(new char[] { '/' });
        command += ' ';

        List<string> Tokens = new List<string>();

        int tokenStart = 0;
        bool inQuotes = false;
        bool inToken = true;
        string currentToken = "";
        for (int i = tokenStart; i < command.Length; i++)
        {
            char currentChar = command[i];
            char nextChar = (i + 1 >= command.Length ? ' ' : command[i + 1]);

            if (!inQuotes && inToken && currentChar == ' ')
            {
                Tokens.Add(currentToken);
                currentToken = "";
                inToken = false;
                continue;
            }

            if (inQuotes && inToken && currentChar == '"')
            {
                Tokens.Add(currentToken);
                currentToken = "";
                inQuotes = false;
                inToken = false;
                if (nextChar == ' ') i++;
                continue;
            }

            if (inQuotes && inToken && currentChar == '\\' && nextChar == '"')
            {
                i++;
                currentToken += nextChar;
                continue;
            }

            if (!inToken && currentChar != ' ')
            {
                inToken = true;
                tokenStart = i;
                if (currentChar == '"')
                {
                    tokenStart++;
                    inQuotes = true;
                    continue;
                }
            }

            currentToken += currentChar;
        }

        return Tokens;
    }
Steffan Donal
  • 2,244
  • 4
  • 24
  • 47
  • I'm not exactly great with regex, but what I have so far is barely functional: `^/\w+( ([^ ]+))*` – Steffan Donal Feb 06 '13 at 10:47
  • @Bergi I've looked at other StackOverflow questions about using quotes to ignore separators and they talk about back references and some other stuff that makes my head hurt :P – Steffan Donal Feb 06 '13 at 10:59
  • 1
    @Ruirize: Do you know how to parse the command if you are to write a normal program rather than a regex? (You don't need to write one, but you should know very detailed about how to do so) If you know how to, then probably a regex solution is possible. You need to define the grammar for the command - it will help greatly in writing a regex. – nhahtdh Feb 06 '13 at 11:53
  • I will see if I can get a C# version working (non regex). I will comment & update the question when done. – Steffan Donal Feb 06 '13 at 12:10
  • 1
    I have added the C# method as requested @nhahtdh – Steffan Donal Feb 06 '13 at 16:05
  • @Ruirize: So you parse the command the same way as the parameter? – nhahtdh Feb 06 '13 at 16:36

2 Answers2

4

You have shown your code - that's good, but it seems that you haven't thought about whether it is reasonable to parse the command like that:

  • Firstly, your code will allow new line character inside the command name and parameters. It would be reasonable if you assume that new line character can never be there.
  • Secondly, \ also needs to be escaped like ", since there will be no way to specify a single \ at the end of a parameter without causing any confusion.
  • Thirdly, it is a bit weird to have the command name parsed the same way as parameters - command names are usually per-determined and fixed, so there is no need to allow for flexible ways to specify it.

I cannot think of one-line solution in JavaScript that is general. JavaScript regex lacks \G, which asserts the last match boundary. So my solution will have to make do with beginning of string assertion ^ and chomping off the string as a token is matched.

(There is not much code here, mostly comments)

function parseCommand(str) {
    /*
     * Trim() in C# will trim off all whitespace characters
     * \s in JavaScript regex also match any whitespace character
     * However, the set of characters considered as whitespace might not be
     * equivalent
     * But you can be sure that \r, \n, \t, space (ASCII 32) are included.
     * 
     * However, allowing all those whitespace characters in the command
     * is questionable.
     */
    str = str.replace(/^\s*\//, "");

    /* Look-ahead (?!") is needed to prevent matching of quoted parameter with
     * missing closing quote
     * The look-ahead comes from the fact that your code does not backtrack
     * while the regex engine will backtrack. Possessive qualifier can prevent
     * backtracking, but it is not supported by JavaScript RegExp.
     *
     * We emulate the effect of \G by using ^ and repeatedly chomping off
     * the string.
     *
     * The regex will match 2 cases:
     * (?!")([^ ]+)
     * This will match non-quoted tokens, which are not allowed to 
     * contain spaces
     * The token is captured into capturing group 1
     *
     * "((?:[^\\"]|\\[\\"])*)"
     * This will match quoted tokens, which consists of 0 or more:
     * non-quote-or-backslash [^\\"] OR escaped quote \"
     * OR escaped backslash \\
     * The text inside the quote is captured into capturing group 2
     */
    var regex = /^ *(?:(?!")([^ ]+)|"((?:[^\\"]|\\[\\"])*)")/;
    var tokens = [];
    var arr;

    while ((arr = str.match(regex)) !== null) {
        if (arr[1] !== void 0) {
            // Non-space token
            tokens.push(arr[1]);
        } else {
            // Quoted token, needs extra processing to
            // convert escaped character back
            tokens.push(arr[2].replace(/\\([\\"])/g, '$1'));
        }

        // Remove the matched text
        str = str.substring(arr[0].length);
    }

    // Test that the leftover consists of only space characters
    if (/^ *$/.test(str)) {
        return tokens;
    } else {
        // The only way to reach here is opened quoted token
        // Your code returns the tokens successfully parsed
        // but I think it is better to show an error here.
        return null;
    }
}
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • This is fantastic - very well explained. Thank you! – Steffan Donal Feb 07 '13 at 09:54
  • @nhahtdh: I suggest changing `\\[\\"]` to `\\.` to [prevent it from failing in bizarre cases, with other backslashes](https://regex101.com/r/jR8qY9/1) – Mariano Sep 07 '15 at 22:11
  • @Mariano: Do note that I write this regex based on the code in the question. If the language design decides to reject ``\`` followed by arbitrary character, then so be it - it's a valid decision. – nhahtdh Sep 08 '15 at 01:58
  • @nhahtdh: Fair enough. I know IRC allows it, but it's not stated in the question. – Mariano Sep 08 '15 at 02:09
0

I created a simple regex that matches the command line you wrote.

/\w+\s((("([^\\"]*\\")*[^\\"]*")|[^ ]+)(\b|\s+))+$
  • /\w+\s finds the first part of your command
  • (((
  • "([^\\"]*\\")* finds any string starting with " that doesn't contain \" followed by a \" one or more times (thus allowing "something\", "some\"thing\" and so on
  • [^\\"]*" followed by a list of characters not containing \ or " and at last a "
  • )|[^ ]+ this is an alternative: finds any nonspace character sequence
  • )
  • (\b|\s+) all followerd by a space or a word boundary
  • )+$ one or more times, one per command, until the end of the string

I'm afraid that this can fail sometimes, but I posted this to show that sometimes the arguments have a structure based on repetition, for example see "something\"something\"something\"end" where the repeated structure is something\", and you can use this idea to build your regex

Gabber
  • 5,152
  • 6
  • 35
  • 49
  • This will only get the last group as part of the match: `'/commandname parameter1 "parameter 2" "parameter \"3\"" parameter"4 parameter\"5'.match(/\/\w+\s((("([^\\"]*\\")*[^\\"]*")|[^ ]+)(\b|\s+))+$/)` – Explosion Pills Feb 06 '13 at 15:14
  • You are right, this was just an example. To get the single tokens I think the correct regex could be `((("([^\\"]*\\")*[^\\"]*")|[^ ]+)(\b|\s+))` repeated through the whole string, of course after getting the command string with `/\w+`. Edited btw, thanks – Gabber Feb 06 '13 at 15:31