my regex is not good at the best of times, and I've been struggling with this for a few hours now. I want to parse a sentence into parts which are mostly words, but include numbers with decimals and/or quoted text.
I have a testbed:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String Ihave = "a.X='Foo123. != was here' and "
+ " T!= v or "
+ " cat <> dog and "
+ " x>-15 and "
+ " \"Peter and Paul\"=\"Mary\" and "
+ " y< 15.23 bah ";
String[] Iwant = {"a.X"
, "="
, "'Foo123. != was here'"
, "and" // accidently left off on the first stackoverflow posting
, "T"
, "!="
, "v"
, "or"
, "cat"
, "<>"
, "dog"
, "and"
, "x"
, ">"
, "-15"
, "and"
, "\"Peter and Paul\""
, "="
, "\"Mary\""
, "and"
, "y"
, "<"
, "15.23"
, "bah"};
String quotedtext = "((\"[^\"]*\"|'[^']*'))";
String nospaces = "[^\\s]";
String alphanumerics = "\\w";
String trythis = quotedtext + ""
+"|(<>)|(!=)" // group these pairs together
+"|("+nospaces+alphanumerics+"*)"
+"|(-\\.|[0-9])" // quoted blocks are ok - but the rest are individual characters
;
Pattern regex = Pattern.compile(trythis);
Matcher regexMatcher = regex.matcher(Ihave);
int x=0;
while (regexMatcher.find()) {
String parsed = regexMatcher.group();
if ( x<Iwant.length ) {
if ( Iwant[x].equals(parsed)) {
System.out.println(parsed);
}
else {
System.out.println(parsed+" but not as expected ("+Iwant[x]+")");
}
}
else {
System.out.println(parsed+" but not as expected");
}
x++;
}
System.out.println("\ndone");
}
}
and when I run it I get the following:
a but not as expected (a.X)
.X but not as expected (=)
= but not as expected ('Foo123. != was here')
'Foo123. != was here' but not as expected (and)
and but not as expected (T)
T but not as expected (!=)
!= but not as expected (v)
v but not as expected (or)
or but not as expected (cat)
cat but not as expected (<>)
<> but not as expected (dog)
dog but not as expected (and)
and but not as expected (x)
x but not as expected (>)
> but not as expected (-15)
-15 but not as expected (and)
and but not as expected ("Peter and Paul")
"Peter and Paul" but not as expected (=)
= but not as expected ("Mary")
"Mary" but not as expected (and)
and but not as expected (y)
y but not as expected (<)
< but not as expected (15.23)
15 but not as expected (bah)
.23 but not as expected
bah but not as expected
done
Although I'm sceptical about the last part of the pattern, everything looks good except the full-stops/decimal points - which are being treated as separate words - how do I fix this (i.e. how do I get "a.X" and "15.23" to stay together)?
I guess the point here for my regex is that the dot shouldn't be treated as a group break.
Any help would be most appreciated. Thanks A