To treat string, the basic way is the regular expression tool ( module re
)
Given the infos you give (this mean they may be unsufficient) the following code does the job:
import re
r = re.compile('(?! )[^[]+?(?= *\[)'
'|'
'\[.+?\]')
s1 = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
print r.findall(s1)
print '---------------'
s2 = "'zug hug'Quantity boondoggle 'fish face monkey "\
"dung' [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
print r.findall(s2)
result
['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]
---------------
["'zug hug'Quantity boondoggle 'fish face monkey dung'", "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]
The regular expression pattern must be undesrtood as follows:
'|'
means OR
So the regex pattern expresses two partial RE:
(?! )[^[]+?(?= *\[)
and
\[.+?\]
The first partial RE :
The core is [^[]+
Brackets define a set of characters. The symbol ^
being after the first bracket [
, it means that the set is defined as all the characters that aren't the ones that follow the symbol ^
.
Presently [^[]
means any character that isn't an opening bracket [ and, as there's a +
after this definition of set, [^[]+
means sequence of characters among them there is no opening bracket.
Now, there is a question mark after [^[]+
: it means that the sequence catched must stop before what is symbolized just after the question mark.
Here, what follows the ?
is (?= *\[)
which is a lookahead assertion, composed of (?=....)
that signals it is a positive lookahead assertion and of *\[
, this last part being the sequence in front of which the catched sequence must stop. *\[
means: zero,one or more blanks until the opening bracket (backslash \
needed to eliminate the meaning of [
as the opening of a set of characters).
There's also (?! )
in front of the core, it's a negative lookahead assertion: it is necessary to make this partial RE to catch only sequences beginning with a blank, so avoiding to catch successions of blanks. Remove this (?! )
and you'll see the effect.
The second partial RE :
\[.+?\]
means : the opening bracket characater [ , a sequence of characters catched by .+?
(the dot matching with any character except \n
) , this sequence must stop in front of the ending bracket character ] that is the last character to be catched.
.
EDIT
string = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
import re
print re.split(' (?=\[)',string)
result
['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]
!!