7

I am trying to split a string into a list by a delimiter (let's say ,) but the delimiter character should be considered the delimiter only if it is not wrapped in a certain pattern, in my particular case <>. IOW, when a comma is nested in <>, it is ignored as a delimiter and becomes just a regular character not to be delimited by.

So if I have the following string:

"first token, <second token part 1, second token part 2>, third token"

it should split into

list[0] = "first token"
list[1] = "second token part 1, second token part 2"
list[2] = "third token"

Needless to say, I cannot just do a simple split by , because that will split the second token into two tokens, second token part 1 and second token part 2, as they have a comma in between them.

How should I define the pattern to do it using Python RegEx?

amphibient
  • 29,770
  • 54
  • 146
  • 240

2 Answers2

10

Update: Since you mentioned that the brackets may be nested, I regret to inform you that a regex solution is not possible in Python. The following can work only if the angle brackets are always balanced and never nested nor escaped:

>>> import re
>>> s = "first token, <second token part 1, second token part 2>, third token"
>>> regex = re.compile(",(?![^<>]*>)")
>>> regex.split(s)
['first token', ' <second token part 1, second token part 2>', ' third token']
>>> [item.strip(" <>") for item in _]
['first token', 'second token part 1, second token part 2', 'third token']

The regex ,(?![^<>]*>) splits on commas only if the next angle bracket that follows isn't a closing angle bracket.

Nested brackets preclude this or any other regex solution from working in Python. You either need a language that supports recursive regexes (like Perl or .NET languages), or use a parser.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • The regex matches the commas in 'first token `,` > `,` third token'. –  Nov 21 '13 at 18:36
  • @sln: Yes, that's what I wrote. No regex solution can handle nested tags (if the nesting is arbitrary). Unfortunately, the information about nesting being possible only came after I had written my answer. – Tim Pietzcker Nov 21 '13 at 18:37
  • Yes, no Python regex can. As you say, others can and its really trivial. –  Nov 21 '13 at 18:40
  • verified. passes functionality **AND** elegance :) – amphibient Nov 21 '13 at 18:47
6

One kludgy way that works for your example is to translate the <>'s into "'s and then treat it as a CSV file:

import csv
import string

s = "first token, <second token part 1, second token part 2>, third token"    
a = s.translate(string.maketrans('<>', '""'))
# first token, "second token part 1, second token part 2", third token
print next(csv.reader([a], skipinitialspace=True))
['first token', 'second token part 1, second token part 2', 'third token']
Jon Clements
  • 138,671
  • 33
  • 247
  • 280