I'm trying to use re.split() to set up for some crude parsing of strings of this form: chord = "{<c,,4-^ f' a>8}"
(Input strings may or may not include whitespace before or after any of the bracket characters, so for example it could instead be: chord = "{ < c,,4-^ f' a> 8}
. Also, brackets don't occur in every input string, so strings may start with 'c,' 'f,' 'a,' '3', etc.)
I want the following results from the above sample string:
"{","<","c,,4-^","f'","a",">","8","}"
That is, the string should be split on whitespace, which should be ignored/omitted in the result, and also on the various bracket characters--but the brackets should be retained in the results. So far all my efforts to compose a regex string for re.split() have produced extraneous separate empty strings/None items. I see several questions on related issues with re.split but everything I've read revolves limiting the dot and star (.*) operator--e.g. My regex is matching too much. How do I make it stop?. I'm using neither dot nor star.
After testing different combinations of or'd expressions I suspect two separate issues may be at work here:
(1) re.split puts empty strings in the result after the left curly brace, but not at the angle brackets or right brace:
re.split(r'(<|{)',chord) --> "","{","","<","c,, (...)
Flailing, I've tried adding a second { to the input string, prefixing the input with f, and escaping the { in the regex; all give the same results. (The initial empty string has appeared in the results with every split character I've tested when it occurs at the start of the string--is that expected?)
(2) All hell breaks loose when the the whitespace finder gets or'd in (|
), outside of the parens. So with re.split(r'\s+|(<|{|})',chord)
of the 21 items in the result list, 9 of them are either ""
or None
. I tried (?:\s+)
, no luck. (Is it possible to combine capture and non-capture groups?)
Since I'm processing a lot of these strings I'd prefer not to have to check for empty strings and Nones during parsing. Any suggestions, re.split-based or otherwise, for achieving the desired results as economically as possible?
(As things stand I'm planning to use str.split() on the input string, and then run re.split in a loop against each result item, knowing I'll need to do extra housekeeping to keep track of whether and how those result strings get further divided by re.split.)