50

I would like to parse a string like this:

-o 1  --long "Some long string"  

into this:

["-o", "1", "--long", 'Some long string']

or similar.

This is different than either getopt, or optparse, which start with sys.argv parsed input (like the output I have above). Is there a standard way to do this? Basically, this is "splitting" while keeping quoted strings together.

My best function so far:

import csv
def split_quote(string,quotechar='"'):
    '''

    >>> split_quote('--blah "Some argument" here')
    ['--blah', 'Some argument', 'here']

    >>> split_quote("--blah 'Some argument' here", quotechar="'")
    ['--blah', 'Some argument', 'here']
    '''
    s = csv.StringIO(string)
    C = csv.reader(s, delimiter=" ",quotechar=quotechar)
    return list(C)[0]
Georgy
  • 12,464
  • 7
  • 65
  • 73
Gregg Lind
  • 20,690
  • 15
  • 67
  • 81
  • My own true forgetfulness revealed: http://stackoverflow.com/questions/92533, has me using shlex.split. Clearly I just forgot about it. – Gregg Lind May 25 '09 at 23:23
  • If what you actually need is "to process options" and not just "to parse strings on commandline", you could consider http://docs.python.org/2/library/argparse.html – Jan Spurny Jul 31 '13 at 12:08

2 Answers2

97

I believe you want the shlex module.

>>> import shlex
>>> shlex.split('-o 1 --long "Some long string"')
['-o', '1', '--long', 'Some long string']
Jacob Gabrielson
  • 34,800
  • 15
  • 46
  • 64
  • Thank you! I knew there was something like this! – Gregg Lind May 22 '09 at 18:41
  • 1
    That's great, except that it doesn't seem to support Unicode strings. The doc says that Python 2.7.3 support Unicode strings, but I'm trying it and `shlex.split(u'abc 123 →')` gives me a `UnicodeEncodeError`. – Craig McQueen May 13 '13 at 22:49
  • 2
    I guess `list(a.decode('utf-8') for a in shlex.split(u'abc 123 →'.encode('utf-8')))` will work. – Craig McQueen May 13 '13 at 23:01
3

Before I was aware of shlex.split, I made the following:

import sys

_WORD_DIVIDERS = set((' ', '\t', '\r', '\n'))

_QUOTE_CHARS_DICT = {
    '\\':   '\\',
    ' ':    ' ',
    '"':    '"',
    'r':    '\r',
    'n':    '\n',
    't':    '\t',
}

def _raise_type_error():
    raise TypeError("Bytes must be decoded to Unicode first")

def parse_to_argv_gen(instring):
    is_in_quotes = False
    instring_iter = iter(instring)
    join_string = instring[0:0]

    c_list = []
    c = ' '
    while True:
        # Skip whitespace
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if c not in _WORD_DIVIDERS:
                    break
                c = next(instring_iter)
        except StopIteration:
            break
        # Read word
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if not is_in_quotes and c in _WORD_DIVIDERS:
                    break
                if c == '"':
                    is_in_quotes = not is_in_quotes
                    c = None
                elif c == '\\':
                    c = next(instring_iter)
                    c = _QUOTE_CHARS_DICT.get(c)
                if c is not None:
                    c_list.append(c)
                c = next(instring_iter)
            yield join_string.join(c_list)
            c_list = []
        except StopIteration:
            yield join_string.join(c_list)
            break

def parse_to_argv(instring):
    return list(parse_to_argv_gen(instring))

This works with Python 2.x and 3.x. On Python 2.x, it works directly with byte strings and Unicode strings. On Python 3.x, it only accepts [Unicode] strings, not bytes objects.

This doesn't behave exactly the same as shell argv splitting—it also allows quoting of CR, LF and TAB characters as \r, \n and \t, converting them to real CR, LF, TAB (shlex.split doesn't do that). So writing my own function was useful for my needs. I guess shlex.split is better if you just want plain shell-style argv splitting. I'm sharing this code in case it's useful as a baseline for doing something slightly different.

Craig McQueen
  • 41,871
  • 30
  • 130
  • 181