1

I'm writing some simple (I thought) regex in Java to remove an asterisk or ampersand which occurs directly next to some specified punctuation.
This was my original code:

String ptr = "\\s*[\\*&]+\\s*";
String punct1 = "[,;=\\{}\\[\\]\\)]"; //need two because bracket rules different for ptr to left or right
String punct2 = "[,;=\\{}\\[\\]\\(]";

out = out.replaceAll(ptr+"("+punct1+")|("+punct2+")"+ptr,"$1");

Which instead of just removing the "ptr" part of the string, removed the punct too! (i.e. replaced the matched string with an empty string)
I examined further by doing:

String ptrStr = ".*"+ptr+"("+punct1+")"+".*|.*("+punct2+")"+ptr+".*";
Matcher m_ptrStr = Pattern.compile(ptrStr).matcher(out);

and found that:

m_ptrStr.matches() //returns true, but...
m_ptrStr.group(1) //returns null??

I have no idea what I'm doing wrong as I've used this exact method before with far more complicated regex and group(1) has always returned the captured group. There must be something I haven't been able to spot, so.. any ideas?

Magg G.
  • 229
  • 3
  • 10
  • 3
    No need to quote `{`, `*` or the parens in character classes – fge Mar 19 '14 at 15:40
  • 2
    Anyway -- I suspect .group(1) is null here because it is your second group which has a match – fge Mar 19 '14 at 15:41
  • oh! I thought group(1) was the first matched group, no matter where in the regex string it was. That explains a lot, thanks! – Magg G. Mar 19 '14 at 15:48

2 Answers2

2

The problem is that you have an alternation with a capturing group on each side:

(regex1)|(regex2)

The matcher will start and search for a match using the first alternation; if not found, it will try the second alternation.

However, those are still two groups, and only one will match. The one which will not match will return null, and this is what happens to you here.

You therefore need to test both groups; since you have a match, at least one will not be null.

fge
  • 119,121
  • 33
  • 254
  • 329
  • Yes, sorry... I am used to regex languages where this is not a cause for concern :/ – fge Mar 19 '14 at 15:49
1

When you have | in your pattern, that means that the matcher is allowed to match one of two patterns. Whichever one it matches, any capture groups for the pattern it matches will return the substrings--but any capture groups for the other pattern will return null, because the other pattern wasn't really matched.

It looks like your pattern is

.*\s*[\*&]+\s*([,;=\{}\[\]\)]).*|.*([,;=\{}\[\]\(])+\s*[\*&]+\s*.*
------------- left ------------- -------------- right ------------

If matches() returns true, then either your string matched the "left" pattern, in which case group(1) will be non-null and group(2) will be null; or else it matched the "right" pattern, in which case group(1) will be null and group(2) non-null. [Note: The matcher will not try to find out if both sides are successful matches. That is, if the left side matches, it won't check the right side.]

ajb
  • 31,309
  • 3
  • 58
  • 84
  • 1
    "The matcher will not check to see if it matches both sides" <-- not quite; it will try the second alternation if the first fails. POSIX regex engines (which Java isn't) will always check both alternations, and so will DFA engines – fge Mar 19 '14 at 15:45
  • @fge I wasn't referring to the case where the first alternative fails, but I've tried to clarify the wording. Interesting (and surprising) tidbit about other regex engines--thanks. – ajb Mar 19 '14 at 15:52