I have to process a string with the following rules:
- It may or may not start with a series of
'
. - It may or may not end with a series of
'
. - Whatever is enclosed between the above should be extracted. However, the enclosed string also may or may not contain a series of
'
.
For example, I can get following strings as input:
''''aa''''
''''aa
aa''''
''''aa''bb''cc''''
For the above examples, I would like to extract the following from them (respectively):
aa
aa
aa
aa''bb''cc
I tried the following code in Java:
Pattern p = Pattern.compile("[^']+(.+'*.+)[^']*");
Matcher m = p.matcher("''''aa''bb''cc''''");
while (m.find()) {
int count = m.groupCount();
System.out.println("count = " + count);
for (int i = 0; i <= count; i++) {
System.out.println("-> " + m.group(i));
}
But I get the following output:
count = 1
-> aa''bb''cc''''
-> ''bb''cc''''
Any pointers?
EDIT: Never mind, I was using a *
at the end of my regex, instead of +
. Doing this change gives me the desired output. But I would still welcome any improvements for the regex.