1

my regex is not good at the best of times, and I've been struggling with this for a few hours now. I want to parse a sentence into parts which are mostly words, but include numbers with decimals and/or quoted text.

I have a testbed:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Test {

public static void main(String[] args) {
    String Ihave        = "a.X='Foo123. != was here' and "
                        + " T!= v or "
                        + " cat <> dog and "
                        + " x>-15 and "
                        + " \"Peter and Paul\"=\"Mary\" and "
                        + " y< 15.23 bah ";

    String[] Iwant = {"a.X"
                            , "="
                            , "'Foo123. != was here'"
                            , "and" // accidently left off on the first stackoverflow posting 
                            , "T"
                            , "!="
                            , "v"
                            , "or"
                            , "cat"
                            , "<>"
                            , "dog"
                            , "and"
                            , "x"
                            , ">"
                            , "-15"
                            , "and"
                            , "\"Peter and Paul\""
                            , "="
                            , "\"Mary\""
                            , "and"
                            , "y"
                            , "<"
                            , "15.23"
                            , "bah"};

    String quotedtext    = "((\"[^\"]*\"|'[^']*'))";
    String nospaces      = "[^\\s]";
    String alphanumerics = "\\w";

    String trythis = quotedtext + "" 
                    +"|(<>)|(!=)" // group these pairs together
                    +"|("+nospaces+alphanumerics+"*)"
                    +"|(-\\.|[0-9])" // quoted blocks are ok - but the rest are individual characters
                    ;

    Pattern regex = Pattern.compile(trythis);
    Matcher regexMatcher = regex.matcher(Ihave);
    int x=0;
    while (regexMatcher.find()) {
        String parsed = regexMatcher.group();
        if ( x<Iwant.length ) {
            if ( Iwant[x].equals(parsed)) {
                System.out.println(parsed);
            }
            else {
                System.out.println(parsed+"                         but not as expected ("+Iwant[x]+")");
            }
        }
        else {
            System.out.println(parsed+"              but not as expected");
        }
        x++;
    } 

    System.out.println("\ndone");
}

}

and when I run it I get the following:

a                         but not as expected (a.X)
.X                         but not as expected (=)
=                         but not as expected ('Foo123. != was here')
'Foo123. != was here'                         but not as expected (and)
and                         but not as expected (T)
T                         but not as expected (!=)
!=                         but not as expected (v)
v                         but not as expected (or)
or                         but not as expected (cat)
cat                         but not as expected (<>)
<>                         but not as expected (dog)
dog                         but not as expected (and)
and                         but not as expected (x)
x                         but not as expected (>)
>                         but not as expected (-15)
-15                         but not as expected (and)
and                         but not as expected ("Peter and Paul")
"Peter and Paul"                         but not as expected (=)
=                         but not as expected ("Mary")
"Mary"                         but not as expected (and)
and                         but not as expected (y)
y                         but not as expected (<)
<                         but not as expected (15.23)
15              but not as expected (bah)
.23              but not as expected
bah              but not as expected

done

Although I'm sceptical about the last part of the pattern, everything looks good except the full-stops/decimal points - which are being treated as separate words - how do I fix this (i.e. how do I get "a.X" and "15.23" to stay together)?

I guess the point here for my regex is that the dot shouldn't be treated as a group break.

Any help would be most appreciated. Thanks A

user1432181
  • 918
  • 1
  • 9
  • 24
  • 1
    Seems I spoke too soon in my answer. which I've deleted. Somebody already provided a good intro to lexers on StackOverflow. http://stackoverflow.com/q/17848207/18157 I already voted to close as "too broad", so I can't make this question a dup of that one, but you should definitely check it out. – Jim Garrison Mar 04 '16 at 19:07
  • Why does the expected result not contain the word 'and' that appears in the first line of `Ihave`, but it does contain that word when it appears elsewhere? – FredK Mar 04 '16 at 19:42
  • FredK - sorry you're right.. it should have had the and... I've updated the code above to fix that. Good spot! – user1432181 Mar 05 '16 at 21:57
  • Ok, although this seem to be confused with a need for language parsing and I'm really just asking about why REGEX isn't parsing the dot as part of the word groups, perhaps REGEX can't do this -so I've Java worked around it in the loop instead: [[code]] Pattern regex = Pattern.compile(trythis); Matcher regexMatcher = regex.matcher(Ihave); int x=1; ArrayList results=new ArrayList(); results.add(""); while (regexMatcher.find()) { if ( parsed.startsWith(".") ) { results.add( results.get(x-1)+parsed ); results.set(x-1, ""); } else { results.add(parsed); } x++; } [[/code]] – user1432181 Mar 07 '16 at 10:20

0 Answers0