2

I'm trying to use re.split() to set up for some crude parsing of strings of this form:
chord = "{<c,,4-^ f' a>8}"
(Input strings may or may not include whitespace before or after any of the bracket characters, so for example it could instead be: chord = "{ < c,,4-^ f' a> 8}. Also, brackets don't occur in every input string, so strings may start with 'c,' 'f,' 'a,' '3', etc.)

I want the following results from the above sample string:
"{","<","c,,4-^","f'","a",">","8","}"

That is, the string should be split on whitespace, which should be ignored/omitted in the result, and also on the various bracket characters--but the brackets should be retained in the results. So far all my efforts to compose a regex string for re.split() have produced extraneous separate empty strings/None items. I see several questions on related issues with re.split but everything I've read revolves limiting the dot and star (.*) operator--e.g. My regex is matching too much. How do I make it stop?. I'm using neither dot nor star.

After testing different combinations of or'd expressions I suspect two separate issues may be at work here:

(1) re.split puts empty strings in the result after the left curly brace, but not at the angle brackets or right brace: re.split(r'(<|{)',chord) --> "","{","","<","c,, (...) Flailing, I've tried adding a second { to the input string, prefixing the input with f, and escaping the { in the regex; all give the same results. (The initial empty string has appeared in the results with every split character I've tested when it occurs at the start of the string--is that expected?)

(2) All hell breaks loose when the the whitespace finder gets or'd in (|), outside of the parens. So with re.split(r'\s+|(<|{|})',chord)of the 21 items in the result list, 9 of them are either "" or None. I tried (?:\s+), no luck. (Is it possible to combine capture and non-capture groups?)

Since I'm processing a lot of these strings I'd prefer not to have to check for empty strings and Nones during parsing. Any suggestions, re.split-based or otherwise, for achieving the desired results as economically as possible?

(As things stand I'm planning to use str.split() on the input string, and then run re.split in a loop against each result item, knowing I'll need to do extra housekeeping to keep track of whether and how those result strings get further divided by re.split.)

Joan Eliot
  • 267
  • 1
  • 8
  • Both of the answers I've received so far were helpful and came in with essentially the same solution. I chose the shorter one as my "answer" because the language was a little clearer and the regex simpler for my situation. The other answer shows how to handle line breaks and gives the link to that great regular expression testing site, so neck-and-neck. – Joan Eliot Sep 13 '19 at 03:16

2 Answers2

1

Maybe, an expression similar to,

[{}]|[^\s><\r\n{}]+|[><]

might be OK to start off.

Here first we collect,

[{}]

then,

[^\s><\r\n{}]+

and finally,

[><]

which you might want to change these char classes depending on the char you wish to collect first, somewhat similar to stack, and you'd likely solve your problem.

Test

import re

print(re.findall(r"[{}]|[^\s><\r\n{}]+|[><]", "{ < c,,4-^ f' a> 8}"))

Output

['{', '<', 'c,,4-^', "f'", 'a', '>', '8', '}']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    Your answer works for all the test strings I've run it against so far. What effect do the two `=` characters in the regex produce? Both occurrences follow right angle brackets--does that combination have special meaning? Or do the `=`characters make literal matches? The expression gives the same result on my sample string when I take out `=`. Also, please see my question about re.findall vs re.split in my comment on dcg's answer. Appreciate very much the pointer to the regex101 site. – Joan Eliot Sep 05 '19 at 05:43
1

Assuming these symbols {}<> are the ones you want to take apart. You can match any token that doesn't contain any of the previous chars with something like [^{<>}\s]+ and of course you can match any of the chars with something like [{}<>].

Then the whole regular expression would be [^{<>}\s]+|[{}<>]. For your example:

>>> import re
>>> chord = "{<c,,4-^  f' a>8}"
>>> re.findall(r'[^{<>}\s]+|[{}<>]', chord)
['{', '<', 'c,,4-^', "f'", 'a', '>', '8', '}']
>>> 

Hope it helps.

dcg
  • 4,187
  • 1
  • 18
  • 32
  • Compact, as readable as a regex can be, and so far it has given the expected results on all of my test strings. It's clear that re.strip wasn't the right tool for this job. Any tips on when to use re.strip vs re.findall to divide strings at character boundaries? – Joan Eliot Sep 05 '19 at 05:52
  • 1
    The thing with `re.split` is that it'll split your string by the matches made by the regular expression you pass in, so you would have to pass a complement of what you're expecting, in this case I don't see a clear solution (nor see one at all) to this using it. – dcg Sep 05 '19 at 14:40