20

I have command line arguments in a string and I need to split it to feed to argparse.ArgumentParser.parse_args.

I see that the documentation uses string.split() plentifully. However in complex cases, this does not work, such as

--foo "spaces in brakets"  --bar escaped\ spaces

Is there a functionality to do that in python?

(A similar question for java was asked here).

P-Gn
  • 23,115
  • 9
  • 87
  • 104

3 Answers3

28

This is what shlex.split was created for.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Nice! And it's available since Python 2.3. – randomir Jul 06 '17 at 10:40
  • Does `shlex.split` have an issue with escaped quote marks? e.g `--foo "bar\"baz"` – P-Gn Jul 06 '17 at 12:13
  • @user1735003: Yes, though it would usually be the shell handling this for you (`shlex` follows mostly the same rules as `sh` shell rules). But if you have a constructed command line like that, it's fine with it, that's the whole point of `shlex`: `shlex.split(r'--foo "bar\"baz"')` produces `['--foo', 'bar"baz']`. The `argparse` docs are [being lazy](https://bugs.python.org/issue20598) when they use `str.split` instead of `shlex.split` (or explicit lists); they were going for brevity, but without the mental load of requiring `shlex` knowledge. – ShadowRanger Jul 06 '17 at 12:57
  • This works great for POSIX systems, but doesn't help if you want to split arguments from a windows format. – Eric Feb 17 '19 at 06:22
  • 2
    @Eric: Given there is no single Windows format (Windows executables receive the raw string and parse it themselves), and that the question was about parsing a string for `argparse` to work with (which has a fixed behavior regardless of OS), your comment doesn't seem particularly relevant to this case. – ShadowRanger Feb 17 '19 at 15:25
  • 1
    Even though windows receives the raw string, most programs define a `main(argc, argv)`, which end up using the parsing provided by their C runtime. `argparse` has a fixed behavior regardless of OS, but that's because it takes a list of strings as an input, typically `sys.argv`. How `sys.argv` gets populated _is_ platform-dependent, and it's worth drawing attention to that. `shlex.split` matches the way `sys.argv` is populated on posix systems, but not how it is populated on windows systems. – Eric Feb 17 '19 at 19:18
6

If you're parsing a windows-style command line, then shlex.split doesn't work correctly - calling subprocess functions on the result will not have the same behavior as passing the string directly to the shell.

In that case, the most reliable way to split a string like the command-line arguments to python is... to pass command line arguments to python:

import sys
import subprocess
import shlex
import json  # json is an easy way to send arbitrary ascii-safe lists of strings out of python

def shell_split(cmd):
    """
    Like `shlex.split`, but uses the Windows splitting syntax when run on Windows.

    On windows, this is the inverse of subprocess.list2cmdline
    """
    if os.name == 'posix':
        return shlex.split(cmd)
    else:
        # TODO: write a version of this that doesn't invoke a subprocess
        if not cmd:
            return []
        full_cmd = '{} {}'.format(
            subprocess.list2cmdline([
                sys.executable, '-c',
                'import sys, json; print(json.dumps(sys.argv[1:]))'
            ]), cmd
        )
        ret = subprocess.check_output(full_cmd).decode()
        return json.loads(ret)

One example of how these differ:

# windows does not treat all backslashes as escapes
>>> shell_split(r'C:\Users\me\some_file.txt "file with spaces"', 'file with spaces')
['C:\\Users\\me\\some_file.txt', 'file with spaces']

# posix does
>>> shlex.split(r'C:\Users\me\some_file.txt "file with spaces"')
['C:Usersmesome_file.txt', 'file with spaces']

# non-posix does not mean Windows - this produces extra quotes
>>> shlex.split(r'C:\Users\me\some_file.txt "file with spaces"', posix=False)
['C:\\Users\\me\\some_file.txt', '"file with spaces"']  
Eric
  • 95,302
  • 53
  • 242
  • 374
2

You could use the split_arg_string helper function from the click package:

import re

def split_arg_string(string):
    """Given an argument string this attempts to split it into small parts."""
    rv = []
    for match in re.finditer(r"('([^'\\]*(?:\\.[^'\\]*)*)'"
                             r'|"([^"\\]*(?:\\.[^"\\]*)*)"'
                             r'|\S+)\s*', string, re.S):
        arg = match.group().strip()
        if arg[:1] == arg[-1:] and arg[:1] in '"\'':
            arg = arg[1:-1].encode('ascii', 'backslashreplace') \
                .decode('unicode-escape')
        try:
            arg = type(string)(arg)
        except UnicodeError:
            pass
        rv.append(arg)
    return rv

For example:

>>> print split_arg_string('"this is a test" 1 2 "1 \\" 2"')
['this is a test', '1', '2', '1 " 2']

The click package is starting to dominate for command-arguments parsing, but I don't think it supports parsing arguments from string (only from argv). The helper function above is used only for bash completion.

Edit: I can nothing but recommend to use the shlex.split() as suggested in the answer by @ShadowRanger. The only reason I'm not deleting this answer is because it provides a little bit faster splitting then the full-blown pure-python tokenizer used in shlex (around 3.5x faster for the example above, 5.9us vs 20.5us). However, this shouldn't be a reason to prefer it over shlex.

randomir
  • 17,989
  • 1
  • 40
  • 55