1

I'll try to make this short and easy since I am having a hard time trying to put into words what exactly it is I am trying to do.

Basically I am trying to match tokens inside tokens or the entire token. The regex I have below works except for when there is a random { that isn't part of a token.

Example:

Tokens start with "{:" and end with ":}"
{:MyTokenFunction({:MyTokenParameter:}):} 
^--- WORKS as it matches "{:MyTokenParameter:}"

{:MyTokenFunction(5):} 
^--- WORKS as it matches "{:MyTokenFunction(5):}"

{:MyTokenFunction(random{string}):} 
^--- The "{" causes no matches, but should match the entire string.

Here's a colored example of what the regex I have matches on. *the first 2 examples are correct, but the 3rd example should match entirely and it doesn't at all. enter image description here

Here's the regex I am currently using which is having issues with the third example:

\{\:[^\{]+?\:\}

For the life of me I cannot figure out how to get around the { causing 0 matches.

I tried to use lookbehinds/aheads, but I wasn't having much luck. Although I would of course love a quick answer; I would love an explanation of what the regex is actually doing more. I have done a lot of searching to try and figure this out, but was unable to find a good example due to the fact that my "tokens" are wrapped by multiple characters and start/end aren't the same.

Thanks

TyCobb
  • 8,909
  • 1
  • 33
  • 53

2 Answers2

2

Your regex matches strings that begin with {:, end with :}, and do not have any { in between. Note the ? that appears after the +, which makes it non-greedy. There is a lot of documentation available on it.

In your case,

{:MyTokenFunction(random{string}):} 

Has that {string} part in the middle, which it rejects because of the curly braces.

If that is a valid string, then that condition shouldn't be there and should be taken out. For example

\{\:.+?\:\}

However, if that condition is a restriction, then your string is not valid input.

MxLDevs
  • 19,048
  • 36
  • 123
  • 194
  • On the other hand, the requirement may simply be "you must have matching brackets", so `{` is only valid if there's an accompanying `}`, in which case you have a different issue to address. If it is a parenthesis-matching problem, where you may have an arbitrary number of `{`'s and `}`'s, then you can't really use regex for that. – MxLDevs May 28 '14 at 23:42
  • The only things that need to match are the `{:` and `:}`. This is an internal tool that is getting updated and I was tasked with polishing it up. Unfortunately, there are strings that can contain just `{` which are valid and will just be part of the standard token. Your regex works except it doesn't capture only the inner token if present. `[ ]` denote match -- `[{:MyTokenFunction({:MyTokenParameter:}]):}` – TyCobb May 28 '14 at 23:45
  • Is that what you expect it to match, or what it actually matches? – MxLDevs May 28 '14 at 23:59
  • @TyCobb Oh, take out the `?`. That tells it not to match greedily and will stop as soon as it can. Not sure if that helps, but if you only want the inner token (and not the outer one), you likely will have to find another solution. – MxLDevs May 29 '14 at 00:01
  • I took it out, but now it matches too much. It should only find the inner most tokens first. This is called recursively so the expression itself doesn't need to do everything. Just match inner if it is inside or entire token if there isn't an inner token. I have updated the question with an example of what it currently does and what I am looking for. Just need to figure out how to match the 3rd example. EDIT: Just saw the change to your comment. =/ – TyCobb May 29 '14 at 00:13
  • @TyCobb It looks like .NET provides some [regex functionality](http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#balancing_group_definition]) that I am not familiar with. In particular, the "balancing group definition", which appears to be used for nested matching. Someone with more experience with .NET's regex might have a solution. More info [here](http://www.regular-expressions.info/balancing.html). I don't really understand it after reading it through once though, but looking at the examples it looks like something worth understanding. – MxLDevs May 29 '14 at 00:36
  • Of interest http://stackoverflow.com/questions/9813751/get-inner-patterns-recursively-using-regex-c-sharp and http://stackoverflow.com/questions/17003799/what-are-regular-expression-balancing-groups. Trying to figure out how this all works... – MxLDevs May 29 '14 at 01:13
  • Thanks a lot for the links! I will definitely play around with these tomorrow once back at work. If you want to update your answer with this information I will +1 it. =) This looks very promising. – TyCobb May 29 '14 at 03:14
  • @TyCobb I still wouldn't recommend using regex for this, but instead tokenizing it properly based on its grammar. Unless you can specify every possible case that you have to deal with and therefore simply need to write an expression that handles every one of those cases, it is much more flexible to just read it the way it is defined. I don't know if a case such as `{:fn1({:token2:}, {:fn3({:token4:}):}:}` is possible but you'd have to craft a pretty tricky expression., – MxLDevs May 29 '14 at 04:04
  • An older version is already in production that goes character by character and the expressions get pretty complicated with many inner tokens. This newer version already works and the instance with `{` was caught purely due to a really old and obsolete file I found while unit testing. I just wanted to get the newer version to handle it and using a combination of regex actually speeds up the parsing of several hundred tokens quite a bit. – TyCobb May 29 '14 at 05:09
2

This is a lovely question because it requires us to balance opening and closing tokens, a task for which .NET happens to have a ready-made feature: balancing groups.

Let's look at this in separate pieces.

Why doesn't your regex work?

[^\{]+ means "match any number of characters that are not a {"

Clearly, that is not going to be able to match the { in {3}

Simple solution (with caveats)

{:.*:}

This will greedily match everything between the opening and closing curly brace. This works if you have only one token per line (and if you are not in DOTALL mode).

However, if you have two tokens on the same line, the regex will eat them both. And if you are in DOTALL mode, this will eat all the tokens. So that's for you to know.

See demo

More complex (but far stronger) solution

To avoid the problem above, we need to balance the braces. In Perl or PCRE, we would use recursion. Since we're in .NET, we'll use balancing groups, which are a beautiful feature of the .NET engine.

Here is one way to do it. That's a mouthful, but I'll explain it below.

(?:{:(?<counter>)(?:(?!{:|:}).)*)+(?::}(?<-counter>)(?:(?!{:|:}).)*)+(?<=:})(?(counter)(?!))

See demo

How does this work?

Here is the same regex, but in free-spacing mode, with comments. I would suggest using this version in code, as it makes it easier to maintain.

(?x) # free-spacing mode
(?:{:(?<counter>)(?:(?!{:|:}).)*)+ # match all the opening {: and increment counter
(?::}(?<-counter>)(?:(?!{:|:}).)*)+ # # match all the closing {: and decrement counter
(?<=:}) # negative lookbehind: we must close tiwht a :} (backtrack if we went too far)
(?(counter)(?!)) # if the counter has not been decremented to zero, then fail (ensuring balance)

Potential Tweaks

Depending on your needs, there are potential tweaks: for instance, if you want tokens to be able to span several lines. Just let us know.

zx81
  • 41,100
  • 9
  • 89
  • 105
  • This worked! Thanks a lot for the explanation too which is most valuable. Although this didn't grab inner tokens first, a few tweaks to the code allowed this newer regex to work. – TyCobb May 29 '14 at 17:35
  • @TyCobb Glad this worked for you, it was a pleasure. :) thanks for letting me know. – zx81 May 29 '14 at 19:41