3

I'm wondering the best way to split a string separated by spaces for the last space in the string which is not inside [, {, ( or ". For instance I could have:

a = 'a b c d e f "something else here"'
b = 'another parse option {(["gets confusing"])}'

For a it should parse into ['a', 'b', 'c', 'd', 'e', 'f'], ["something else here"]
and b should parse into ['another', 'parse', 'option'], ['([{"gets confusing"}])']

Right now I have this:

def getMin(aList):  
    min = sys.maxint
    for item in aList:  
        if item < min and item != -1:
            min = item
    return min  

myList = []
myList.append(b.find('['))
myList.append(b.find('{'))
myList.append(b.find('('))
myList.append(b.find('"'))
myMin = getMin(myList)
print b[:myMin], b[myMin:]

I'm sure there's better ways to do this and I'm open to all suggestions

Chrispresso
  • 3,660
  • 2
  • 19
  • 31

3 Answers3

2

You can use regular expressions:

import re
def parse(text):
    m = re.search(r'(.*) ([[({"].*)', text)
    if not m:
        return None
    return m.group(1).split(), [m.group(2)]

The first part (.*) catches everything up to the section in quotes or parenthesis, and the second part catches anything starting at a character in ([{".

If you need something more robust, this has a more complicated regular expression, but it will make sure that the opening token is matched, and it makes the last expression optional.

def parse(text):
    m = re.search(r'(.*?)(?: ("[^"]*"|\([^)]*\)|\[[^]]*\]|\{[^}]*\}))?$', text)
    if not m:
        return None
    return m.group(1).split(), [m.group(2)]
2

Matching vs. Splitting

There is an easy solution. The key is to understand that matching and splitting are two sides of the same coin. When you say "match all", that means "split on what I don't want to match", and vice-versa. Instead of splitting, we're going to match, and you'll end up with the same result.

The Reduced, Simple Version

Let's start with the simplest version of the regex so you don't get scared by something long:

{[^{}]*}|\S+

This matches all the items of your second string—the same as if we were splitting (see demo)

  • The left side of the | alternation matches complete sets of {braces}.
  • The right side of the | matches any characters that are not whitespace characters.

It's that simple!

The Full Regex

We also need to match "full quotes", (full parentheses) and [full brackets]. No problem: we just add them to the alternation. Just for clarity, I'm throwing them together in a non-capture group (?: so that the \S+ pops out on its own, but there is no need.

(?:{[^{}]*}|"[^"]*"|\([^()]*\)|\[[^][]*\])|\S+

See demo.

Notes Potential Improvements

  • We could replace the quoted string regex by one that accepts escaped quotes
  • We could replace the brace, brackets and parentheses expressions by recursive expressions to allow nested constructions, but you'd have to use Matthew Barnett's (awesome) regex module instead of re
  • The technique is related to a simple and beautiful trick to Match (or replace) a pattern except when...

Let me know if you have questions!

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
1

Perhaps this link will help:

Split a string by spaces -- preserving quoted substrings -- in Python

It explains how to preserve quoted substrings when splitting a string by spaces.

Community
  • 1
  • 1
A A
  • 33
  • 4
  • should also work for parentheses of different kinds, brilliant! – m.wasowski Jun 24 '14 at 22:17
  • shelx isn't very useful in this scenario since it is really only good at splitting quotes. I know I can split in the manner I am trying using it, but that requires manipulating `shlex.spaces`, `shlex.quotes`, etc. and writing a subparser to get the inbetween data. – Chrispresso Jun 24 '14 at 22:17
  • @user2599709 if it does not satisfy your needs, please add to your question use case when it is not working. – m.wasowski Jun 24 '14 at 22:23
  • @m.wasowski Have you tried regex? Is there some reason why that isn't an acceptable approach? – A A Jun 24 '14 at 22:23
  • @m.wasowski if I use shlex and split based off of spaces, that does not meet my requirement that I provided in the original question. Shlex splits on ALL spaces, including ones in between the characters I do not want it to. – Chrispresso Jun 24 '14 at 22:27