6

I am parsing a file that has lines such as

type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")

And I want to split this into separate fields.

In my example, there are four fields: type, title, pages, and comments.

The desired result after splitting is

['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments("good read")]

It is evident that a simple string split won't work, because it will just split at every space. I want to split on spaces, but preserve anything in between parenthesis and quotation marks.

How can I split this?

martineau
  • 119,623
  • 25
  • 170
  • 301
MxLDevs
  • 19,048
  • 36
  • 123
  • 194

5 Answers5

16

This regex should work for you \s+(?=[^()]*(?:\(|$))

result = re.split(r"\s+(?=[^()]*(?:\(|$))", subject)

Explanation

r"""
\s             # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=            # Assert that the regex below can be matched, starting at this position (positive lookahead)
   [^()]          # Match a single character NOT present in the list “()”
      *              # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:              # Match the regular expression below
                     # Match either the regular expression below (attempting the next alternative only if this one fails)
         \(             # Match the character “(” literally
      |              # Or match regular expression number 2 below (the entire group fails if this one fails to match)
         $              # Assert position at the end of a line (at the end of the string or before a line break character)
   )
)
"""
Narendra Yadala
  • 9,554
  • 1
  • 28
  • 43
  • Nice, although it seems to be adding some extra parenthesis in the returned list (I'm not sure where they're coming from either). I'm using py3. – MxLDevs Mar 10 '12 at 07:48
  • 2
    Try this: `re.split(r"\s+(?=[^()]*(?:\(|$))", subject)` – San4ez Mar 10 '12 at 07:50
  • 1
    @Keikoku fixed it. It is because of the capturing group. – Narendra Yadala Mar 10 '12 at 07:51
  • 1
    How would you extend this to support both round () and square [] brackets? Ie. ignore all strings in-between any (well matched) pair of such brackets? – gen Jul 18 '18 at 01:40
3

Split on ") " and add a ) back to each element except the last.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
1

Let me add a non-regex solution:

line = 'type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")'

count = 0 # Bracket counter
last_break = 0 # Index of the last break
parts = []
for j,char in enumerate(line):
    if char is '(': count += 1
    elif char is ')': count -= 1
    elif char is ' ' and count is 0:
        parts.append(line[last_break:(j)])
        last_break = j+1
parts.append(line[last_break:]) # Add last element
parts = tuple(p for p in parts if p) # Convert to tuple and remove empty

for p in parts:
    print(p)

In general there are certain things you cannot do with regular expressions, and there can be serious performance penalties (especially for lookahead and lookbehind) which can cause them not to be the best solution for a certain problem.

Also; I thought I'd mention the pyparsing module which can be used to create custom text parsers.

MarcinKonowalczyk
  • 2,577
  • 4
  • 20
  • 26
  • 1
    It's been 8 years since I had initially asked the question but I would agree, using a parser is better than regex especially for things like parentheses and quotation matching. – MxLDevs Sep 21 '20 at 20:46
1

I would try using a positive look-behind assertion.

r'(?<=\))\s+'

Example:

>>> import re
>>> result = re.split(r'(?<=\))\s+', 'type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")')
>>> result
['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments(
"good read")']
yas
  • 3,520
  • 4
  • 25
  • 38
0

Here's another non-regex solution to split a string between spaces except when a sub-string is between parenthesis.

    file_line = 'type("book") title("golden apples") pages(10 - 35 70 200 - 234) comments("good read")'
    list_of_params = []
    param = ''

    between_parenthesis = False
    for character in file_line:
        if between_parenthesis:
            if character == ')':
                between_parenthesis= False
        else:
            if character == '(':
                between_parenthesis= True

            if character == ' ':
                list_of_params.append(param)
                param = ''
                continue

        param += character

    list_of_params.append(param)
    print(list_of_params)

result:

    ['type("book")', 'title("golden apples")', 'pages(10 - 35 70 200 - 234)', 'comments("good read")']
ryonthli
  • 1
  • 1