340

I have a string which is like this:

this is "a test"

I'm trying to write something in Python to split it up by space while ignoring spaces within quotes. The result I'm looking for is:

['this', 'is', 'a test']

PS. I know you are going to ask "what happens if there are quotes within the quotes, well, in my application, that will never happen.

buhtz
  • 10,774
  • 18
  • 76
  • 149
Adam Pierce
  • 33,531
  • 22
  • 69
  • 89

16 Answers16

517

You want split, from the built-in shlex module.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

This should do exactly what you want.

If you want to preserve the quotation marks, then you can pass the posix=False kwarg.

>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']
Jerub
  • 41,746
  • 15
  • 73
  • 90
  • This is the simplest method that directly answers OP's question. If you need support for nested strings using escaped characters and/or multiple quote types, see the answer from @user261478 – drootang Apr 28 '23 at 13:33
74

Have a look at the shlex module, particularly shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']
Pavel Štěrba
  • 2,822
  • 2
  • 28
  • 50
Allen
  • 5,034
  • 22
  • 30
48

I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

Explanation:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex probably provides more features, though.

  • 1
    I was thinking much the same, but would suggest instead [t.strip('"') for t in re.findall(r'[^\s"]+|"[^"]*"', 'this is "a test"')] – Darius Bacon Feb 08 '09 at 03:09
  • What does that split do when there are apostrophes inside the double quotes: He said, "Don't do that!" I think it will treat <"Don'> as one unit, won't it? – Jonathan Leffler Feb 08 '09 at 03:21
  • Jonathan: in this case, no, I made two mistakes that cancel each other out in that case: the greedy .* will go to the final ". :-) I should have said "( |\\\".*?\\\"|'.*?')". Nice catch. –  Feb 08 '09 at 03:39
  • 2
    +1 I'm using this because it was a heck of a lot faster than shlex. – hanleyp Nov 16 '09 at 19:44
  • +1 from me, the 2nd regex (comments) works for my needs whereas the first doesn't. As such I've edited in the second regex but left the first easily visible. –  Mar 16 '10 at 21:47
  • P.S. this is excellent, I don't need the features of shlex, just a split like argv. I'd give it +2 if I could. –  Mar 16 '10 at 21:47
  • 1
    that code almost looks like perl, haven't you heard of r"raw strings"? – SpliFF Mar 22 '10 at 06:41
  • Consider this data: string = r'simple "quot ed" "ignore the escape with quotes\\" "howboutthemapostrophe\'s?" "\"withescapedquotes\"" "\"with unbalanced escaped quotes"' The Jonathan / Kate / Ninefingers update botches the withescapedquotes term, into three (degenerate-quote-alone, withescapedquotes, another-degenerate). shlex.strip(string) is fine. Can that be done via re? – jackr Jun 25 '10 at 18:55
  • 3
    Why the triple backslash ? won't a simple backslash do the same ? – Doppelganger Aug 12 '11 at 21:20
  • This one handles unbalanced quotes and unicode, shlex does not :( – lambacck Nov 23 '11 at 18:29
  • +1 I like this answer because it actually preserves the quotations, unlike Shlex. Shlex split should only do splitting, it shouldn't remove the quotations on me. Though perhaps it's configurable. – leetNightshade Jul 22 '13 at 15:21
  • 1
    Actually, one thing I don't like about this is that anything before/after quotes are not split properly. If I have a string like this 'PARAMS val1="Thing" val2="Thing2"'. I expect the string to split into three pieces, but it splits into 5. It's been a while since I've done regex, so I don't feel like trying to solve it using your solution right now. – leetNightshade Jul 23 '13 at 00:00
  • 2
    You should use raw strings when using regular expressions. – asmeurer Dec 19 '13 at 02:29
  • This one handles both types of quotes and strips only the parsed one: `[''.join(t) for t in re.findall(r"""([^\s"']+)|"([^"]*)"|'([^']*)'""", test)]` – MortenB May 26 '16 at 14:50
  • To use any delimiter (not white space only) and to fix the problem of the wrong split before/after quotes have a look here https://stackoverflow.com/a/56791724/1201614 – luca Jun 29 '19 at 10:33
  • Can somebody expand on the "explanation" given? I'm fairly comfortable with regex, but I did not understand the "explanation". Can anyone provide a bit more detail here? It looks like this would be my preferred solution, except that it is unclear to me. – SherylHohman Apr 01 '20 at 19:17
  • I prefer this one rather than using `shlex` because for me, I don't need any unwanted escape things, just the splitting thing is what I needed. – 0xAA55 Nov 25 '21 at 06:24
34

Depending on your use case, you may also want to check out the csv module:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

Output:

['this', 'is', 'a string']
['and', 'more', 'stuff']
Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
Ryan Ginstrom
  • 13,915
  • 5
  • 45
  • 60
  • 2
    useful, when shlex strips some needed characters – scraplesh Mar 29 '13 at 18:08
  • 1
    CSV's [use two double quotes in a row](https://tools.ietf.org/html/rfc4180#section-2) (as in side-by-side, `""`) to represent one double quote `"`, so will turn two double quotes into a single quote `'this is "a string""'` and `'this is "a string"""'` will both map to `['this', 'is', 'a string"']` – Boris Verkhovskiy Oct 31 '19 at 23:45
  • If the delimiter is other than space, shlex is adding the delimiter to individual strings. – Vinod Oct 03 '22 at 02:04
  • useful, I had the case of the comma as the thousand separator like `['UK', 'London', '1,234,567.89]` then using `for row in csv.reader(lines, delimiter=","` interprets the records correclty – Domenico Spidy Tamburro Aug 16 '23 at 14:23
17

I use shlex.split to process 70,000,000 lines of squid log, it's so slow. So I switched to re.

Please try this, if you have performance problem with shlex.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)
Daniel Dai
  • 1,019
  • 11
  • 24
12

It seems that for performance reasons re is faster. Here is my solution using a least greedy operator that preserves the outer quotes:

re.findall("(?:\".*?\"|\S)+", s)

Result:

['this', 'is', '"a test"']

It leaves constructs like aaa"bla blub"bbb together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

Please note that this also matches the empty string "" by means of the \S part of the pattern.

hochl
  • 12,524
  • 10
  • 53
  • 87
  • 1
    Another important advantage of this solution is its versatility with respect to the delimiting character (e.g. `,` via `'(?:".*?"|[^,])+'`). The same applies to the quoting (enclosing) character(s). – a_guest Jun 05 '19 at 13:24
11

The main problem with the accepted shlex approach is that it does not ignore escape characters outside quoted substrings, and gives slightly unexpected results in some corner cases.

I have the following use case, where I need a split function that splits input strings such that either single-quoted or double-quoted substrings are preserved, with the ability to escape quotes within such a substring. Quotes within an unquoted string should not be treated differently from any other character. Some example test cases with the expected output:

 input string        | expected output
===============================================
 'abc def'           | ['abc', 'def']
 "abc \\s def"       | ['abc', '\\s', 'def']
 '"abc def" ghi'     | ['abc def', 'ghi']
 "'abc def' ghi"     | ['abc def', 'ghi']
 '"abc \\" def" ghi' | ['abc " def', 'ghi']
 "'abc \\' def' ghi" | ["abc ' def", 'ghi']
 "'abc \\s def' ghi" | ['abc \\s def', 'ghi']
 '"abc \\s def" ghi' | ['abc \\s def', 'ghi']
 '"" test'           | ['', 'test']
 "'' test"           | ['', 'test']
 "abc'def"           | ["abc'def"]
 "abc'def'"          | ["abc'def'"]
 "abc'def' ghi"      | ["abc'def'", 'ghi']
 "abc'def'ghi"       | ["abc'def'ghi"]
 'abc"def'           | ['abc"def']
 'abc"def"'          | ['abc"def"']
 'abc"def" ghi'      | ['abc"def"', 'ghi']
 'abc"def"ghi'       | ['abc"def"ghi']
 "r'AA' r'.*_xyz$'"  | ["r'AA'", "r'.*_xyz$'"]
 'abc"def ghi"'      | ['abc"def ghi"']
 'abc"def ghi""jkl"' | ['abc"def ghi""jkl"']
 'a"b c"d"e"f"g h"'  | ['a"b c"d"e"f"g h"']
 'c="ls /" type key' | ['c="ls /"', 'type', 'key']
 "abc'def ghi'"      | ["abc'def ghi'"]
 "c='ls /' type key" | ["c='ls /'", 'type', 'key']

I ended up with the following function to split a string such that the expected output results for all input strings:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]

It ain't pretty; but it works. The following test application checks the results of other approaches (shlex and csv for now) and the custom split implementation:

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])
    test_case_fn(fn, 'abc"def ghi"', ['abc"def ghi"'])
    test_case_fn(fn, 'abc"def ghi""jkl"', ['abc"def ghi""jkl"'])
    test_case_fn(fn, 'a"b c"d"e"f"g h"', ['a"b c"d"e"f"g h"'])
    test_case_fn(fn, 'c="ls /" type key', ['c="ls /"', 'type', 'key'])
    test_case_fn(fn, "abc'def ghi'", ["abc'def ghi'"])
    test_case_fn(fn, "c='ls /' type key", ["c='ls /'", 'type', 'key'])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

Output:

shlex

[ OK ] abc def -> ['abc', 'def']
[FAIL] abc \s def -> ['abc', 's', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[FAIL] 'abc \' def' ghi -> exception: No closing quotation
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[FAIL] abc'def -> exception: No closing quotation
[FAIL] abc'def' -> ['abcdef']
[FAIL] abc'def' ghi -> ['abcdef', 'ghi']
[FAIL] abc'def'ghi -> ['abcdefghi']
[FAIL] abc"def -> exception: No closing quotation
[FAIL] abc"def" -> ['abcdef']
[FAIL] abc"def" ghi -> ['abcdef', 'ghi']
[FAIL] abc"def"ghi -> ['abcdefghi']
[FAIL] r'AA' r'.*_xyz$' -> ['rAA', 'r.*_xyz$']
[FAIL] abc"def ghi" -> ['abcdef ghi']
[FAIL] abc"def ghi""jkl" -> ['abcdef ghijkl']
[FAIL] a"b c"d"e"f"g h" -> ['ab cdefg h']
[FAIL] c="ls /" type key -> ['c=ls /', 'type', 'key']
[FAIL] abc'def ghi' -> ['abcdef ghi']
[FAIL] c='ls /' type key -> ['c=ls /', 'type', 'key']

csv

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi']
[FAIL] "abc \" def" ghi -> ['abc \\', 'def"', 'ghi']
[FAIL] 'abc \' def' ghi -> ["'abc", "\\'", "def'", 'ghi']
[FAIL] 'abc \s def' ghi -> ["'abc", '\\s', "def'", 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[FAIL] '' test -> ["''", 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]
[FAIL] abc"def ghi" -> ['abc"def', 'ghi"']
[FAIL] abc"def ghi""jkl" -> ['abc"def', 'ghi""jkl"']
[FAIL] a"b c"d"e"f"g h" -> ['a"b', 'c"d"e"f"g', 'h"']
[FAIL] c="ls /" type key -> ['c="ls', '/"', 'type', 'key']
[FAIL] abc'def ghi' -> ["abc'def", "ghi'"]
[FAIL] c='ls /' type key -> ["c='ls", "/'", 'type', 'key']

re

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[ OK ] 'abc \' def' ghi -> ["abc ' def", 'ghi']
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]
[ OK ] abc"def ghi" -> ['abc"def ghi"']
[ OK ] abc"def ghi""jkl" -> ['abc"def ghi""jkl"']
[ OK ] a"b c"d"e"f"g h" -> ['a"b c"d"e"f"g h"']
[ OK ] c="ls /" type key -> ['c="ls /"', 'type', 'key']
[ OK ] abc'def ghi' -> ["abc'def ghi'"]
[ OK ] c='ls /' type key -> ["c='ls /'", 'type', 'key']

shlex: 0.335ms per iteration
csv: 0.036ms per iteration
re: 0.068ms per iteration

So performance is much better than shlex, and can be improved further by precompiling the regular expression, in which case it will outperform the csv approach.

Ton van den Heuvel
  • 10,157
  • 6
  • 43
  • 82
  • Not sure what you're talking about: ``` >>> shlex.split('this is "a test"') ['this', 'is', 'a test'] >>> shlex.split('this is \\"a test\\"') ['this', 'is', '"a', 'test"'] >>> shlex.split('this is "a \\"test\\""') ['this', 'is', 'a "test"'] ``` – morsik May 15 '19 at 16:27
  • @morsik, what is your point? Maybe your use case does not match mine? When you look at the test cases you'll see all cases where `shlex` does not behave as expected for my use cases. – Ton van den Heuvel May 15 '19 at 16:35
  • I was hopefull, but unfortunately, you approach fails too in a case I need where `shlex` and `csv` fail also. String to parse: `command="echo hi" type key`. – Jean-Bernard Jansen Apr 19 '21 at 13:07
  • @Jean-BernardJansen, there were indeed some issues when it comes to handling quotes; I've updated the regex and it should now handle your case correctly. – Ton van den Heuvel Jul 05 '22 at 08:23
8

Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.

Both versions do the same thing, but splitter is a bit more readable then splitter2.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)
elifiner
  • 7,347
  • 8
  • 39
  • 48
  • You should have used re.Scanner instead. It's more reliable (and I have in fact implemented a shlex-like using re.Scanner). – Devin Jeanpierre Mar 24 '09 at 16:37
  • +1 Hm, this is a pretty smart idea, breaking the problem down into multiple steps so the answer isn't terribly complex. Shlex didn't do exactly what I needed, even with trying to tweak it. And the single pass regex solutions were getting really weird and complicated. – leetNightshade Jul 23 '13 at 16:31
7

Speed test of different answers:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop
har777
  • 503
  • 5
  • 12
4

To preserve quotes use this function:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args
2

Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

This works on strings like 'This is " a \\\"test\\\"\\\'s substring"' (the insane markup is unfortunately necessary to keep Python from removing the escapes).

If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
  • This is by far the best answer. Using a negative lookbehind is the the best way to ensure you don't match escaped end-quote characters and don't start a new quote with an escaped start-quote character. It's easily extensible to support multiple quoting characters (e.g., ", ', {}, [], etc) – drootang Apr 28 '23 at 13:30
  • In my case I needed to preserve the quote character on each string, so i just removed the .strip() commands in the list comprehension – drootang Apr 28 '23 at 13:31
2

As an option try tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']
Mikhail Zakharov
  • 904
  • 11
  • 22
1

To get around the unicode issues in some Python 2 versions, I suggest:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]
moschlar
  • 1,286
  • 11
  • 18
  • For python 2.7.5 this should be: `split = lambda a: [b.decode('utf-8') for b in _split(a)]` otherwise you get: `UnicodeDecodeError: 'ascii' codec can't decode byte ... in position ...: ordinal not in range(128)` – Peter Varo Jun 27 '13 at 00:43
0

I suggest:

test string:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

to capture also "" and '':

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

to ignore empty "" and '':

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']
hussic
  • 1,816
  • 9
  • 10
-3

Try this:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

Some test strings:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]
pjz
  • 41,842
  • 6
  • 48
  • 60
-3

If you don't care about sub strings than a simple

>>> 'a short sized string with spaces '.split()

Performance:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

Or string module

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

Performance: String module seems to perform better than string methods

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

Or you can use RE engine

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

Performance

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop

Gregory
  • 1,479
  • 15
  • 22
  • 11
    You seem to have missed the whole point of the question. There are quoted sections in the string that need to not be split. – rjmunro Oct 31 '08 at 23:08