2

I'm trying to parse an argument string into an array of arguments. I have it mostly working, but it definitely seems like there'd be an easier way to go about doing this.

Rules:

  • Quoted strings ("some string") should be treated as a single argument, but the quotes should be removed from the resulting string
  • Any whitespace should separate arguments, except when we're already at the argCount (allowing the final argument to be unquoted, with all non-leading/trailing whitespace included)
  • Quotes should be ignored in the final argument, being left in the string as-is, unless the quotes in question are surrounding the entire final argument.

Examples:

  • this is an arg string with argCount 2 should result in ['this', 'is an arg string']
  • "this is" an arg string with argCount 2 should result in ['this is', 'an arg string']
  • "this is" "an arg" string too with argCount 3 should result in ['this is', 'an arg', 'string too']
  • this\nis an arg\n string! with argCount 3 should result in ['this', 'is', 'an arg\n string!']
  • this\nis an arg string! with argCount 2 should result in ['this', 'is an arg string!']
  • this\nis an arg string\nwith multiple lines in the final arg.\n inner whitespace still here with argCount 2 should result in ['this', 'is an arg string\nwith multiple lines in the final arg.\n inner whitespace still here']
  • this is an arg " string with "quotes in the final" argument. with argCount 2 should result in ['this', 'is an arg " string with "quotes in the final" argument.']
  • "this is" "an arg string with nested "quotes" in the final arg. neat." with argCount 2 should result in ['this is', 'an arg string with nested "quotes" in the final arg. neat.']

My current code:

function parseArgs(argString, argCount) {
    if(argCount) {
        if(argCount < 2) throw new RangeError('argCount must be at least 2.');
        const args = [];
        const newlinesReplaced = argString.trim().replace(/\n/g, '{!~NL~!}');
        const argv = stringArgv(newlinesReplaced);
        if(argv.length > 0) {
            for(let i = 0; i < argCount - 1; i++) args.push(argv.shift());
            if(argv.length > 0) args.push(argv.join(' ').replace(/{!~NL~!}/g, '\n').replace(/\n{3,}/g, '\n\n'));
        }
        return args;
    } else {
        return stringArgv(argString);
    }
}

I'm using the string-argv library, which is what stringArgv is calling. The four last examples do not work properly with my code, as the dummy newline replacement tokens cause the arguments to be smashed together during the stringArgv call - and quotes are taking complete priority.

Update:

I clarified the quotes rule, and added a rule about quotes also being left untouched in the final argument. Added two additional examples to go along with the new rule.

Gawdl3y
  • 230
  • 1
  • 3
  • 15
  • What if you want to include literal quotes? – trincot Aug 20 '16 at 09:51
  • What is `stringArgv`? – Wiktor Stribiżew Aug 20 '16 at 11:02
  • @WiktorStribiżew See the very bottom of the question. – Gawdl3y Aug 20 '16 at 17:26
  • @trincot Good point, forgot to mention that. I'll update the question. – Gawdl3y Aug 20 '16 at 17:27
  • What if the result should be `['a "test"', 'rest']`, with the literal double quotes in the first element? How would the original string look to get that? – trincot Aug 20 '16 at 17:47
  • @trincot Well, in my use-case, I highly doubt double quotes would ever be necessary in any of the beginning arguments - but I suppose some form of escaping would work, like with a backslash. I feel that this may add too much complexity to the parsing, though. – Gawdl3y Aug 20 '16 at 17:57
  • I believe instead of listing all of your conditions you better simply provide us with a string which covers all of them and simply let us know the expected output. – Redu Aug 20 '16 at 18:22

2 Answers2

2

I haven't had the chance to test thoroughly but the following code probably solves your question.

function handleQuotedString(m,sm){
  return sm.trim().indexOf(" ") === -1 ? sm : '"' + sm.trim() + '"';
}
function getArguments(s,n){
  return s.trim()                       // get rid of any preceding and trailing whitespaces
          .replace(/\n/g, " \n ")       // make word\nword => word \n word
          .replace(/"([\S\s]+?)"/g,handleQuotedString)
          .split(" ")                   // get words into array
          .reduce((p,w) => w[0] === '"' ||
                           w[0] === "'" ? (p[0] = true, p.concat(w.slice(1)))
                                        : w[w.length-1] === '"' ||
                                          w[w.length-1] === "'" ? (p[0] = false, p[p.length-1]+= " " + w.slice(0,w.length-1), p)
                                                                : p[0] ? (w !== "\n" && (p[p.length-1]+= " " + w.slice(1,w.length-1)), p)
                                                                       : p.concat(w), [false])
          .slice(1)
          .reduce((args,arg) => args[0] ? arg !== "\n" &&
                                          arg !== ""   ? (args[0]--,args.concat(arg))
                                                       : args
                                        : (args[args.length-1]+= " " + arg || " ",args),[n])
          .slice(1);
}
var s = 'hi there\nas "you" see "   this\nis  " "an arg" string\n     too';
console.log(getArguments(s,7));

The first reduce inclusively merges words starting with a quote up until it meets another word ending with a quote.

The second reduce sets up arguments according to the given count and other conditions.

Of course there might be tons of special characters in the fed string those need to be eliminated. This can be done with an initial filtration stage.

Redu
  • 25,060
  • 6
  • 56
  • 76
  • Your solution also seems to work flawlessly, from my limited testing. I so wish I could mark two answers as accepted. I chose @trincot's though because the code is much nicer to look at due to using regular expressions rather than loads of branching string/array manipulation. Both answers also seem to perform (in terms of speed) identically. Thanks for your answer! – Gawdl3y Aug 20 '16 at 18:54
  • @Gawdl3y Thank you. trincot is a great guy not only at JS but also in sorting out algorithms. I should also plus him. – Redu Aug 20 '16 at 18:57
  • Thank you, Redu, for those kind words ;-) – trincot Aug 20 '16 at 18:59
2

You could use a regular expression for this:

function mySplit(s, argCount) {
    var re = /\s*(?:("|')([^]*?)\1|(\S+))\s*/g,
        result = [],
        match = []; // should be non-null
    argCount = argCount || s.length; // default: large enough to get all items
    // get match and push the capture group that is not null to the result
    while (--argCount && (match = re.exec(s))) result.push(match[2] || match[3]);
    // if text remains, push it to the array as it is, except for 
    // wrapping quotes, which are removed from it
    if (match && re.lastIndex < s.length)
        result.push(s.substr(re.lastIndex).replace(/^("|')([^]*)\1$/g, '$2'));
    return result;
}
// Sample input
var s = '"this is" "an arg" string too';
// Split it
var parts = mySplit(s, 3);
// Show result
console.log(parts);

This gives the desired result for all example cases you provided.

Backslash escaping

If you want to support backslash escaping, so you can embed literal quotes in your first arguments without interrupting those arguments, then you can use this regular expression in the above code:

var re = /\s*(?:("|')((?:\\[^]|[^\\])*?)\1|(\S+))\s*/g,

The magic is in (?:\\[^]|[^\\]): either a backslash followed by something, or not-a-backslash. This way, the quote that follows a backslash will never get matched as an argument-closing one.

The (?: makes the group non capturing (i.e. it is not numbered for $1 style back-references).

The [^] may look weird, but it is a way in JavaScript regexes to say "any character", which is more broad than the dot, which does not match newlines. There is the s modifier out there to give the dot operator this broader meaning, but that modifier is not supported in JavaScript.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • This definitely solves the original version of the question. With your comment, however, I added an additional rule - and this solution still works for that too, with the exception of quotes surrounding the final argument. I may actually ignore that part of the rule myself, as it probably adds unnecessary complexity to the parsing. – Gawdl3y Aug 20 '16 at 17:50
  • I just added `.replace(/^"(.*)"$/g, '$1'))` to my code, which removes the wrapping double quotes on the final argument (if found on both sides). I also added an extension for when you want backslash escaping. – trincot Aug 20 '16 at 17:58
  • That certainly does the job, without adding very much complexity. I did find one issue, though - the quoted strings in the beginning (non-final) arguments don't allow newlines within them - the newline causes a split, moving the rest of the contents into the next arg. – Gawdl3y Aug 20 '16 at 18:03
  • That is fixed now with `[^]` in the regex instead of `.`. – trincot Aug 20 '16 at 18:10
  • Your answer made me end up at http://stackoverflow.com/a/21419569/4543207. So thank you for that. One slight thing worth mentioning is; while unclear in the question in case you have `"this\nis"` to be included in the required number of arguments it doesn't get rid of the new line. – Redu Aug 20 '16 at 18:16
  • You want all newlines to be removed from the result strings? Should they be replaced with spaces? – trincot Aug 20 '16 at 18:19
  • Shouldn't the `(.*)` in the final argument quotes matching also be replaced with `([^]*)`? – Gawdl3y Aug 20 '16 at 18:19
  • Yes, @Gawdl3y, well-spotted. I have updated that now. – trincot Aug 20 '16 at 18:20
  • @Redu All whitespace should be left as-is within a quoted string. – Gawdl3y Aug 20 '16 at 18:21
  • Added another fix: `\s*` at the start of the regex to eliminate spaces at the very start of the input. – trincot Aug 20 '16 at 18:26
  • @Redu's answer raised a thought with me: Single quotes should probably be supported as well. Any first instance of a quote in the expression should be replaced with `('|")`, and then the second instance would be a backreference to it. `match[2] || match[3]` would also need to be replaced with `match[3] || match[4]`, I believe, since the quote itself is captured. I'm not fantastic at regular expressions, but I think this would do the job. – Gawdl3y Aug 20 '16 at 18:30
  • It would need to be more elaborate than that, since a `'` opening quote has to be closed by a `'` as well, and not a `"`. Same in the other sense. I can update my answer if you wish this. – trincot Aug 20 '16 at 18:33
  • Wouldn't using a backreference accomplish that, though? – Gawdl3y Aug 20 '16 at 18:34
  • Yes, that is possible. – trincot Aug 20 '16 at 18:39
  • `\s*(('|")([^]*?)\2|(\S+))\s*` is the expression I came up with. – Gawdl3y Aug 20 '16 at 18:40
  • Perfect, and then indeed the match indexes shift: `match[3] || match[4]` – trincot Aug 20 '16 at 18:42
  • In fact, now that I look again at the regex, the outermost parentheses are not needed, so the index number and backreference number could be decreased again, if you remove those. – trincot Aug 20 '16 at 18:45
  • I take that back: the outer parentheses are needed, but they could be made non-capturing. I have updated my answer with all these last mentioned changes. – trincot Aug 20 '16 at 18:48
  • Thanks for your answer, and all your changes! I've marked this one as accepted, simply due to having cleaner code. Both answers also perform (in terms of speed) identically. – Gawdl3y Aug 20 '16 at 18:55