4

I am given a string which is of this pattern:

[blah blah blah] [more blah] some text

I want to split the string into three parts: blah blah blah, more blah and some text.

A crude way to do it is to use mystr.split('] '), and then removes the lead [ from the first two elements. Is there a better and performant way (need to do this for thousands of strings very quickly).

skyork
  • 7,113
  • 18
  • 63
  • 103
  • is the first set of `[blah blah blah]` and the second set `[more blah]` always going to contain the same amount of characters? – TehTris May 22 '13 at 21:21
  • @TehTris, not really. They will contain content of varied lengths. – skyork May 22 '13 at 22:32
  • Then `re` is probably going to be your best bet, unless you want to do a bunch of silly stuff like `first = line[:line.find(']')]` `second = line[len(first):line.find(']')]` `third = line[len(first)+len(second):]` – TehTris May 22 '13 at 22:49

2 Answers2

5

You can use a regular expression to extract the text, if you know that it will be in that form. For efficiency, you can precompile the regex and then repeatedly use it when matching.

prog = re.compile('\[([^\]]*)\]\s*\[([^\]]*)\]\s*(.*)')

for mystr in string_list:
    result = prog.match(mystr)
    groups = result.groups()

If you'd like an explanation on the regex itself, you can get one using this tool.

voithos
  • 68,482
  • 12
  • 101
  • 116
  • thank you for your answer. I wonder if it is possible to use regex to match the similar situation where the second `[more blah]` block may or may not exist. In other words, can we use a regex to split strings which are either `[blah blah] [more blah] some text` or `[blah blah] some text`? – skyork May 23 '13 at 18:12
  • 1
    @skyork: Yep, just add an 'optional' (`?`) modifier to a non-capturing group `(?: ... )` which encloses the second set of `[]`. In other words, this: `\[([^\]]*)\]\s*(?:\[([^\]]*)\])?\s*(.*)` – voithos May 23 '13 at 18:53
1

You can use a regular expression to split where you want to leave out characters:

>>> import re
>>> s = '[...] [...] ...'
>>> re.split(r'\[|\] *\[?', s)[1:]
['...', '...', '...']
pvoosten
  • 3,247
  • 27
  • 43