Need regexp to find substring between two tokens

Question

I suspect this has already been answered somewhere, but I can't find it, so...

I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)

myString = "A=abc;B=def_3%^123+-;C=123;"  ;

myB = getInnerString(myString, "B=", ";" )  ;

method getInnerString(inStr, startToken, endToken){
   return inStr.replace( EXPRESSION, "$1");
}

so, when I run this using expression ".+B=(.+);.+" I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.

I've tried using (?=) in search of that first ';' but it gives me the same result.

I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.

any and all help greatly appreciated.

Similar question on SO:

dmckee, your edit seems pointless. Yes, other people have seen similar questions, but that doesn't necessarily mean that they help. — Evan Fosmark, Jan 29 '09 at 04:58
@Evan: If SO is to be a repository of good answers, then multiple instances of single questions are disruptive unless they are interlinked. So I link. Mostly back, but sometimes forward too. I'll edit the poor grace. No excuses for that. --Cheers — dmckee --- ex-moderator kitten, Jan 29 '09 at 15:37
@dmckee - isn't this what search mechanism supposed to do? Did you just cut/pasted search results of "regexp inside" or some such? Please don't do that - the links are as long as the question itself and quite distracting. — , Jan 29 '09 at 15:45
@Arkadiy: I'm willing to be talked out of this, but neither the search nor the "related" sidebar work well. Indeed, I see the steady accumulation of repeats as evidence of how badly the search works. — dmckee --- ex-moderator kitten, Jan 29 '09 at 19:15
@Arkadiy: I get my lists from variations I have already answered, or have favorite-ed, or can remember enough of the title to find with a search (and from the favorite bar of those). — dmckee --- ex-moderator kitten, Jan 29 '09 at 19:16
Arkadiy, Evan's answer was significantly more succinct than the others (that I failed to find on my own). However, if someone had given me those links I would have deduced what I needed from them as well, so, that would have been equally helpful. Not sure what the problem is though. — Yevgeny Simkin, Jan 30 '09 at 03:53

score 7 · Accepted Answer · answered Jan 28 '09 at 21:58

7

You're using a greedy pattern by not specifying the ? in it. Try this:

".+B=(.+?);.+"

answered Jan 28 '09 at 21:58

Evan Fosmark

98,895
36
105
117

Thanks! works like a charm, though having read description for '?' I'm not sure I see why it would produce said effect. – Yevgeny Simkin Jan 28 '09 at 22:07
Dr.Dredel, it makes it match as few # of characters as possible. Without it, it matches as many as possible (making it greedy because it takes so much). – Evan Fosmark Jan 28 '09 at 22:09
But it takes a lot of backtracking. – Gumbo Jan 28 '09 at 22:20
No, non-greedy quantifiers *eliminate* backtracking by doing some extra work up front. – Alan Moore Jan 29 '09 at 03:13
And what regex implementation does this? I ran a test in RegexBuddy with all regex flavors and everyone had to backtrack and needed 82 steps to find a match. – Gumbo Jan 29 '09 at 09:26
It's the .+ at the beginning of the regex that's causing all the backtracking. But that, and the one at the end, only need to be there because the OP is doing a 'replace' when he should be doing a 'find'. – Alan Moore Jan 29 '09 at 09:45
Alan M, what do you mean I *should be doing a find? I need the rest of stuff in the string (the start and end around the substring I'm in need of) to go away... I'm not clear on what you're saying. – Yevgeny Simkin Jan 30 '09 at 03:47
No, you shouldn't need to match those parts of the string. I'll have to post a separate answer to explain (I'll hit the backtracking thing, too). – Alan Moore Jan 30 '09 at 06:27

score 5 · Answer 2 · answered Jan 28 '09 at 21:58

5

Try this:

B=([^;]+);

This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.

answered Jan 28 '09 at 21:58

Gumbo

643,351
109
780
844

score 2 · Answer 3 · answered Jan 30 '09 at 06:36

(This is a continuation of the conversation from the comments to Evan's answer.)

Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.

All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):

String s = "A=abc;B=def_3%^123+-;C=123;";

Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
  System.out.println(m.group(1));
}

Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:

print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;

...so we content ourselves with hacks like this:

System.out.println("A=abc;B=def_3%^123+-;C=123;"
    .replaceFirst(".+B=(.*?);.+", "$1"));

Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.

I am using java and am infuriated by how sad its regex options are compared to perl. 2 other infuriating weaknesses are the absence of the last / option (where you stick the i,g etc.) and instead having to run some bizarro .IGNORE_CASE constant, or the obscenely — Yevgeny Simkin, Jan 30 '09 at 15:44
ugly necessity to escape all your '\' with additional \, making the (already difficult to examine with human eyes) regex MUCH harder to look at. Not to mention that if you need to run a string through multiple regex, there's a good chance that the resulting string will lose one level of '\'. — Yevgeny Simkin, Jan 30 '09 at 15:45
I admit that I'm brand new to using Regex in Java, but I note in your comment the reference to the lack of elegance of its use in perl (with which I am familiar) and am inclined to completely agree. lastly, — Yevgeny Simkin, Jan 30 '09 at 15:46
from code design standpoint, the original example offered by Evan is a lot prettier, albeit more wasteful cycle wise. — Yevgeny Simkin, Jan 30 '09 at 15:47
The transition from Perl to Java is bound to be painful anyway, Java being so much more rigid and verbose. Just try to accept it on its own terms. As for the modifiers, I almost never use IGNORE_CASE and such; just stick (?i) at the beginning of the regex itself. — Alan Moore, Jan 31 '09 at 00:14

Need regexp to find substring between two tokens

3 Answers3

Linked