In my code I'm relying on the use of java regex Pattern find(). In the pattern matching I wanted to ignore away some text that may occur in between. In the below example I did that by introducing
(?:\n.+)?
Despite marking it as optional, the find() no longer matches the whole input string but matches() continues to match the whole string, when the fluff text is not present in the input string. Below I'm illustrating an example that has the fluff text and then the problematic example without the fluff text that has different behaviors for matches() and find().
Matching with the fluff text in the input giving identical results for find() and matches():
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Deleteme {
public static void main(String args[]) {
Pattern pattern = Pattern.compile(
"BUY ([A-Z ]+) (\\d*\\.\\d+|\\d+) (CE|PE) (AT|ABOVE) (\\d*\\.\\d+|\\d+)(?:-(\\d*\\.\\d+|\\d+))?(?:\\n.+)?(?:[\\n|\\s]+SL (\\d*\\.\\d+|\\d+)[\\n|\\s]+TGT ([\\d\\.,+]+))?(?:(?:[\\n|\\s]+.+)?[\\n|\\s]+(January|February|March|April|May|June|July|August|September|October|November|December) EXPIRY)?",
Pattern.CASE_INSENSITIVE);
String msg = "BUY HAL 3450 CE ABOVE 70\n" +
"Fluff text\n" +
"SL 5\n" +
"TGT 10,11,13,15++++";
System.out.println("Result from find():");
Matcher matcher = pattern.matcher(msg);
if (matcher.find()) {
printGroups(matcher);
}
System.out.println("\nResult from matches():");
matcher = pattern.matcher(msg);
if (matcher.matches()) {
printGroups(matcher);
}
}
private static void printGroups(Matcher matcher) {
System.out.println("group(0)=" + matcher.group(0));
for (int i = 1; i < matcher.groupCount(); i++) {
System.out.println("group(" + i + " )=" + matcher.group(i));
}
}
}
This is the output
Result from find():
group(0)=BUY HAL 3450 CE ABOVE 70
Fluff text
SL 5
TGT 10,11,13,15++++
group(1 )=HAL
group(2 )=3450
group(3 )=CE
group(4 )=ABOVE
group(5 )=70
group(6 )=null
group(7 )=5
group(8 )=10,11,13,15++++
Result from matches():
group(0)=BUY HAL 3450 CE ABOVE 70
Fluff text
SL 5
TGT 10,11,13,15++++
group(1 )=HAL
group(2 )=3450
group(3 )=CE
group(4 )=ABOVE
group(5 )=70
group(6 )=null
group(7 )=5
group(8 )=10,11,13,15++++
Matching without the fluff text in the input giving different results for find() and matches():
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Deleteme {
public static void main(String args[]) {
Pattern pattern = Pattern.compile(
"BUY ([A-Z ]+) (\\d*\\.\\d+|\\d+) (CE|PE) (AT|ABOVE) (\\d*\\.\\d+|\\d+)(?:-(\\d*\\.\\d+|\\d+))?(?:\\n.+)?(?:[\\n|\\s]+SL (\\d*\\.\\d+|\\d+)[\\n|\\s]+TGT ([\\d\\.,+]+))?(?:(?:[\\n|\\s]+.+)?[\\n|\\s]+(January|February|March|April|May|June|July|August|September|October|November|December) EXPIRY)?",
Pattern.CASE_INSENSITIVE);
String msg = "BUY HAL 3450 CE ABOVE 70\n" +
"SL 5\n" +
"TGT 10,11,13,15++++";
System.out.println("Result from find():");
Matcher matcher = pattern.matcher(msg);
if (matcher.find()) {
printGroups(matcher);
}
System.out.println("\nResult from matches():");
matcher = pattern.matcher(msg);
if (matcher.matches()) {
printGroups(matcher);
}
}
private static void printGroups(Matcher matcher) {
System.out.println("group(0)=" + matcher.group(0));
for (int i = 1; i < matcher.groupCount(); i++) {
System.out.println("group(" + i + " )=" + matcher.group(i));
}
}
}
This is the output:
Result from find():
group(0)=BUY HAL 3450 CE ABOVE 70
SL 5
group(1 )=HAL
group(2 )=3450
group(3 )=CE
group(4 )=ABOVE
group(5 )=70
group(6 )=null
group(7 )=null
group(8 )=null
Result from matches():
group(0)=BUY HAL 3450 CE ABOVE 70
SL 5
TGT 10,11,13,15++++
group(1 )=HAL
group(2 )=3450
group(3 )=CE
group(4 )=ABOVE
group(5 )=70
group(6 )=null
group(7 )=5
group(8 )=10,11,13,15++++
In the above output notice
group(0)=BUY HAL 3450 CE ABOVE 70
SL 5
but the value 5 is not found in the matched groups, which is also puzzling.
I have also tried to match away the fluff text using non-greedy wildcard match
(?:\n.+?)?
but it didn't make any difference. Can someone please shed some light why the find() doesn't match remaining text if I match away the fluff text that is only optional?
Updated code using the suggested fix to have non-greedy match for the entire fluff pattern (?:\n.+)?, but still the problem persists.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Deleteme {
public static void main(String args[]) {
Pattern pattern = Pattern.compile(
"BUY ([A-Z ]+) (\\d*\\.\\d+|\\d+) (CE|PE) (AT|ABOVE) (\\d*\\.\\d+|\\d+)(?:-(\\d*\\.\\d+|\\d+))?" +
"(?:\\n+.+?)??(?:[\\n|\\s]+SL (\\d*\\.\\d+|\\d+)[\\n|\\s]+TGT ([\\d\\.,+]+))?(?:(?:[\\n|\\s]+.+?)??\\n+(January|February|March|April|May|June|July|August|September|October|November|December) EXPIRY)?",
Pattern.CASE_INSENSITIVE);
String msg = "BUY HAL 3450 CE ABOVE 70\n" +
"Currently trading at 65\n" +
"SL 5\n" +
"TGT 10,11,13,15++++";
System.out.println("Result from find():");
Matcher matcher = pattern.matcher(msg);
if (matcher.find()) {
printGroups(matcher);
}
System.out.println("\nResult from matches():");
matcher = pattern.matcher(msg);
if (matcher.matches()) {
printGroups(matcher);
}
}
private static void printGroups(Matcher matcher) {
System.out.println("group(0)=" + matcher.group(0));
for (int i = 1; i < matcher.groupCount(); i++) {
System.out.println("group(" + i + " )=" + matcher.group(i));
}
}
}
The output is same as before.