9

This question has been asked and answered many times before. Some examples: [1], [2]. But there doesn't seem to be something somewhat more general. What I'm looking for is for a way to split strings at commas that are not within quotes or pairs of delimiters. For instance:

s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'

should be split into a list of three elements

['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']

The problem now is that this can get more complicated since we can look into pairs of <> and ().

s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'

which should be split into:

['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']

The naive solution without using regex is to parse the string by looking for the characters ,<(. If either < or ( are found then we start counting the parity. We can only split at a comma if the parity is zero. For instance say we want to split s2, we can start with parity = 0 and when we reach s2[3] we encounter < which will increase parity by 1. The parity will only decrease when it encounters > or ) and it will increase when it encounters < or (. While the parity is not 0 we can simply ignore the commas and not do any splitting.

The question here is, is there a way to this quickly with regex? I was really looking into this solution but this doesn't seem like it covers the examples I have given.

A more general function would be something like this:

def split_at(text, delimiter, exceptions):
    """Split text at the specified delimiter if the delimiter is not
    within the exceptions"""

Some uses would be like this:

split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]

Would regex be able to handle this or is it necessary to create a specialized parser?

Community
  • 1
  • 1
jmlopez
  • 4,853
  • 4
  • 40
  • 74
  • Regular expressions will not help you in this case since the language (i.e. group of strings) you are trying to parse is not regular. Given that you allow for arbitrary nesting of tags, there is no easy way to regex your way out of this. – Yuval Adam Dec 15 '13 at 20:15
  • 1
    Regex cannot in fact handle this, and you wouldn't want it to. The complexity is linear at a minimum, so you would necessarily always get better performance with the parity checker. You don't have to build it yourself though. Python's `csv` module does a lot of the legwork. – Slater Victoroff Dec 15 '13 at 20:17
  • 2
    Argh, don't say that regex can't handle it ! Maybe the python flavor couldn't, but other flavors like PCRE could do it ! This is [a proof](http://regex101.com/r/wU7lC9), we might even get fancy and use recursive patterns to take into account nested `<>()` – HamZa Dec 15 '13 at 20:48
  • This is also possible with regex if you know the maximum nested recursion depth of bracketed elements. But in Python where you don't have [recursive regex support](http://www.regular-expressions.info/recurse.html) go with a more maintainable parser function. – Dean Taylor Dec 15 '13 at 21:06
  • 1
    Aaaand [I did it](http://regex101.com/r/gZ7nL0), now the question is why did I O_o ? – HamZa Dec 15 '13 at 21:07

3 Answers3

8

While it's not possible to use a Regular Expression, the following simple code will achieve the desired result:

def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
    result = []
    buff = ""
    level = 0
    is_quoted = False

    for char in text:
        if char in delimiter and level == 0 and not is_quoted:
            result.append(buff)
            buff = ""
        else:
            buff += char

            if char in opens:
                level += 1
            if char in closes:
                level -= 1
            if char in quotes:
                is_quoted = not is_quoted

    if not buff == "":
        result.append(buff)

    return result

Running this in the interpreter:

>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')                                                                                                                                 
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']
Aaron Cronin
  • 2,093
  • 14
  • 13
  • `if char in closes: level -= 1 continue if char in opens:` That should let you add delimiters that both open and close, like the literal quote. so `"msg, with comma"` passes. No need for a seprate handler for this case. – kalhartt Dec 15 '13 at 20:33
5

using iterators and generators:

def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
    fst, snd = set(pairs.keys()), set(pairs.values())
    it = txt.__iter__()

    def loop():
        from collections import defaultdict
        cnt = defaultdict(int)

        while True:
            ch = it.__next__()
            if ch == delim and not any (cnt[x] for x in snd):
                return
            elif ch in fst:
                cnt[pairs[ch]] += 1
            elif ch in snd:
                cnt[ch] -= 1
            yield ch

    while it.__length_hint__():
        yield ''.join(loop())

and,

>>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
>>> [x for x in tokenize(txt)]
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
4

If you have recursive nested expressions, you can split on the commas and validate that they are matching doing this with pyparsing:

import pyparsing as pp

def CommaSplit(txt):
    ''' Replicate the function of str.split(',') but do not split on nested expressions or in quoted strings'''
    com_lok=[]
    comma = pp.Suppress(',')
    # note the location of each comma outside an ignored expression:
    comma.setParseAction(lambda s, lok, toks: com_lok.append(lok))
    ident = pp.Word(pp.alphas+"_", pp.alphanums+"_")  # python identifier
    ex1=(ident+pp.nestedExpr(opener='<', closer='>'))   # Ignore everthing inside nested '< >'
    ex2=(ident+pp.nestedExpr())                       # Ignore everthing inside nested '( )'
    ex3=pp.Regex(r'("|\').*?\1')                      # Ignore everything inside "'" or '"'
    atom = ex1 | ex2 | ex3 | comma
    expr = pp.OneOrMore(atom) + pp.ZeroOrMore(comma  + atom )
    try:
        result=expr.parseString(txt)
    except pp.ParseException:
        return [txt]
    else:    
        return [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]             


tests='''\
obj<1, 2, 3>, x(4, 5), "msg, with comma"
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma"
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3>
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5),  , 'msg, with comma', obj<1, sub<6, 7>, 3>
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)
'''

for te in tests.splitlines():
    result=CommaSplit(te)
    print(te,'==>\n\t',result)

Prints:

obj<1, 2, 3>, x(4, 5), "msg, with comma" ==>
     ['obj<1, 2, 3>', ' x(4, 5)', ' "msg, with comma"']
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma" ==>
     ['nesteobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', ' "msg, with comma"']
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3> ==>
     ['nestedobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', " 'msg, with comma'", ' additional<1, sub<6, 7>, 3>']
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5),  , 'msg, with comma', obj<1, sub<6, 7>, 3> ==>
     ['bare_comma<1, sub(6, 7), 3>', ' x(4, y(8, 9), 5)', '  ', " 'msg, with comma'", ' obj<1, sub<6, 7>, 3>']
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3) ==>
     ["bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)"]

The current behavior is just like '(something does not split), b, "in quotes", c'.split',') including keeping the leading spaces and the quotes. It is trivial to strip the quotes and leading spaces from the fields.

Change the else under try to:

else:
    rtr = [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
    if strip_fields:
        rtr=[e.strip().strip('\'"') for e in rtr]
    return rtr  
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Downside to this approach is you then have to build conditionals to re-stitch the items that weren't supposed to be split. – brandonscript Dec 15 '13 at 20:31
  • 1
    This is not correct since it split the string `"obj<1, 2, 3>"`. – jmlopez Dec 15 '13 at 20:32
  • I agree that libraries are the sensible solution, but this does not answer the question correctly. – Aaron Cronin Dec 15 '13 at 20:39
  • May want to consider another fix since the following does not work: `result=expr.parseString('obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"')` – jmlopez Dec 16 '13 at 04:41
  • @jmlopez: OK -- I fixed it again and learned a bit of pyparsing in the process. That is a very good question! – dawg Dec 20 '13 at 01:33