3

Basically, I need to split the string like

"one quoted argument" those are separate arguments "but not \"this one\""

to get in result the list of arguments

  • "one quoted argument"
  • those
  • are
  • separate
  • "but not \"this one\""

This regex "(\"|[^"])*"|[^ ]+ nearly does the job but the issue is that regular expression always (at least in java) tries to match the longest string possible.

In consequence, when I apply the regex to a string that starts and ends with a quoted arguments, it matches the whole string and does not create a group for each argument.

Is there a way to tweak this regex or the matcher or the pattern or whatever to handle that?

Note: don't tell me I could use GetOpt or CommandLine.parse or anything else similar.
My concern is about pure java regex (if possible but I doubt it...).

poussma
  • 7,033
  • 3
  • 43
  • 68
  • 1
    I don't think it's possible with regular expressions, but I may be wrong. – arshajii Nov 21 '12 at 14:33
  • 4
    What about using `*?` so that the regular expression is not greedy. `"(\\"|[^"])*?"|[^ ]+` matches what you need. – Alex Nov 21 '12 at 14:34
  • @Alex If you make that an answer I'll upvote it – durron597 Nov 21 '12 at 14:37
  • How should it handle `"one quoted argument" those are separate arguments "but not \\"this one\\""`? – Hans Then Nov 21 '12 at 14:37
  • @Alex, thanks, non greedy quantifiers are the solutions ! – poussma Nov 21 '12 at 14:39
  • 1
    A. R. S. is correct, you can't use Java (Perl 5 Compatible) regular expressions to parse the full set of possible command lines. The reason is that the escaping you're referring to can nest recursively. You will either need a parser, or a different regular expression engine (see the regex system in newer versions of Perl, which can do this). – Kyle Burton Nov 21 '12 at 14:41
  • @ZNK-M I added the comment as an answer – Alex Nov 21 '12 at 14:42

4 Answers4

4

regular expression always (at least in java) tries to match the longest string possible.

Um... no.

That is controlled by if you use greedy or non-greedy expressions. See some examples. Using a non-greedy one (by adding a question mark) should do it. It's called lazy quantification.

The default is greedy, but it certainly doesn't mean it is always that way.

eis
  • 51,991
  • 13
  • 150
  • 199
4

You may use the non greedy qualifier *? to make it work:

"(\\"|[^"])*?"|[^ ]+

See this link for an example in action: http://gskinner.com/RegExr/?32srs

Alex
  • 25,147
  • 6
  • 59
  • 55
2
public static String[] parseCommand( String cmd )
{
    if( cmd == null || cmd.length() == 0 )
    {
        return new String[]
        {};
    }

    cmd = cmd.trim();
    String regExp = "\"(\\\"|[^\"])*?\"|[^ ]+";
    Pattern pattern = Pattern.compile( regExp, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE );
    Matcher matcher = pattern.matcher( cmd );
    List< String > matches = new ArrayList< String >();
    while( matcher.find() ) {
        matches.add( matcher.group() );
    }
    String[] parsedCommand = matches.toArray(new String[] {});
    return parsedCommand;
}
Igor
  • 21
  • 1
  • Maybe throw NPE on null ? Also you should cache the pattern as per [Markus answer](http://stackoverflow.com/a/22472588/281545) - and `cmd.length() == 0` will be taken care by the regex – Mr_and_Mrs_D Jul 07 '14 at 13:53
  • @Igor this is a great start at the problem, but doesn't work for the last case. – vallentin Apr 04 '16 at 12:21
2

I came up with this one (thanks Alex for giving me the good starting point :))

/**
 * Pattern that is capable of dealing with complex command line quoting and
 * escaping. This can recognize correctly:
 * <ul>
 * <li>"double quoted strings"
 * <li>'single quoted strings'
 * <li>"escaped \"quotes within\" quoted string"
 * <li>C:\paths\like\this or "C:\path like\this"
 * <li>--arguments=like_this or "--args=like this" or '--args=like this' or
 * --args="like this" or --args='like this'
 * <li>quoted\ whitespaces\\t (spaces & tabs)
 * <li>and probably more :)
 * </ul>
 */
private static final Pattern cliCracker = Pattern
    .compile(
       "[^\\s]*\"(\\\\+\"|[^\"])*?\"|[^\\s]*'(\\\\+'|[^'])*?'|(\\\\\\s|[^\\s])+",
       Pattern.MULTILINE);
Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
Markus Duft
  • 151
  • 1
  • 5
  • Will be using it and test. FWIW here is the implementation of [translateCommandline](https://commons.apache.org/proper/commons-exec/apidocs/src-html/org/apache/commons/exec/CommandLine.html) - line 337 - see also http://stackoverflow.com/questions/3259143/split-a-string-containing-command-line-parameters-into-a-string-in-java – Mr_and_Mrs_D Jul 07 '14 at 13:50