This question has been asked and answered many times before. Some examples: [1], [2]. But there doesn't seem to be something somewhat more general. What I'm looking for is for a way to split strings at commas that are not within quotes or pairs of delimiters. For instance:
s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'
should be split into a list of three elements
['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']
The problem now is that this can get more complicated since we can look into pairs of <>
and ()
.
s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'
which should be split into:
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
The naive solution without using regex is to parse the string by looking for the characters ,<(
. If either <
or (
are found then we start counting the parity. We can only split at a comma if the parity is zero. For instance say we want to split s2
, we can start with parity = 0
and when we reach s2[3]
we encounter <
which will increase parity by 1. The parity will only decrease when it encounters >
or )
and it will increase when it encounters <
or (
. While the parity is not 0 we can simply ignore the commas and not do any splitting.
The question here is, is there a way to this quickly with regex? I was really looking into this solution but this doesn't seem like it covers the examples I have given.
A more general function would be something like this:
def split_at(text, delimiter, exceptions):
"""Split text at the specified delimiter if the delimiter is not
within the exceptions"""
Some uses would be like this:
split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]
Would regex be able to handle this or is it necessary to create a specialized parser?