4

I would like to turn this:

mystr = '  foo1   (foo2 foo3 (foo4))' 

into:

['foo1','foo2 foo3 (foo4)']

So basically I have to split based on a number of spaces/tabs and parenthesis.

I have seen that re package split function can handle several delimiters (Python: Split string with multiple delimiters) but I can not manage to understand the right approach to parse this kind of strings.

Which would be the best -most pythonic- and simple approach?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
M.E.
  • 4,955
  • 4
  • 49
  • 128
  • Judging by this answer, `pyparsing` may be the way to go: https://stackoverflow.com/a/14059322/3019689 – jez Aug 21 '17 at 00:18
  • You want to split only at the first occurence of a space? – ABcDexter Aug 21 '17 at 00:19
  • @ABcDexter yes I do, I want to get the first "word" and everything between the ( ), which might contain ()'s too which shall be considered as inner character same as any other. – M.E. Aug 21 '17 at 00:22
  • Will the string always be of the form `word (...)`? Or will there sometimes be stuff after the parentheses? – Izaak van Dongen Aug 21 '17 at 00:24
  • the string will be always consistent with format `word (....)` – M.E. Aug 21 '17 at 00:25
  • I think a bigger question is, will there ever be multiple non-nested sets of parentheses on a single input line? That's the situation that regex can't handle properly if it's able to handle the nested case. E.g. do you ever need to parse something like `foo (bar (baz)) (quux)`? I don't think a regex can get both the `baz and `quux` parts right at the same time. – Blckknght Aug 21 '17 at 00:29
  • @Blckknght no there won't be such structures. It will be strictly `word (....)` but inside the parentheses there might be any character including parentheses. – M.E. Aug 21 '17 at 00:31

3 Answers3

5

As far as I can understand, this is consistent with what you want, and is pretty simple. It just uses some slicing to isolate the first word and the part between parentheses. It also has to use strip a couple of times due to the extra spaces. It may seem a little verbose, but to be honest if the task can be accomplished with such simple string operations I feel like complicated parsing is unnecessary (although I may have gotten it wrong). Note that this is flexible in the amount of whitespace to split by.

mystr = '  foo1   (foo2 foo3 (foo4))' 
mystr = mystr.strip()
i = mystr.index(' ')
a = mystr[:i].strip()
b = mystr[i:].strip()[1:-1]
print([a, b])

with output

['foo1', 'foo2 foo3 (foo4)']

Although I'm still not entirely clear if this is what you want. Let me know if it works or what needs changing.

Izaak van Dongen
  • 2,450
  • 13
  • 23
  • Thanks, this is exactly what I was looking for. As a side question, would it be complex to be able to deal with tabs same as spaces? so instead of any number of spaces, it could deal with any number of spaces/tabs. – M.E. Aug 21 '17 at 00:36
  • 1
    To be honest, at that point I'd just do `mystr = mystr.replace("\t", " ")` and then use the old code from there. It might not be the best approach but it's simple. – Izaak van Dongen Aug 21 '17 at 00:37
  • That is simple enough and easy to follow/understand. Thanks – M.E. Aug 21 '17 at 00:38
3

If the structure of your string is as rigidly defined as you say, you can use a regular expression to parse it pretty easily:

import re

mystr = '  foo1   (foo2 foo3 (foo4))'

pattern = r'(\S+)\s+\((.*)\)'
match = re.search(pattern, mystr)
results = match.groups() # ('foo1', 'foo2 foo3 (foo4)')

Be careful with this approach though if your real input is not as well defined as you have suggested your question. Regular expressions can only parse regular languages, and the way parentheses usually work is not "regular". In this question you only cared about handling a single set parentheses (the outermost) so a simple greedy match works. It might be hard or impossible to adapt this solution to other formats of input, even if they seem very similar!

Blckknght
  • 100,903
  • 11
  • 120
  • 169
0
[mystr.split('   ')[0].strip(),mystr.split('   ')[1][1:-1]]

A simple one-liner. Output:

['foo1', 'foo2 foo3 (foo4)']
whackamadoodle3000
  • 6,684
  • 4
  • 27
  • 44