0

I would like to create a regex so that I can split a string in Java with the following constraints:

Any non-word character, except for:
 (a) Characters surrounded by ' '
 (b) Any instance of    :=   >=   <=   <>   ..

So that for the following sample string:

print('*');  x := x - 100

I can get the following result in a String[]:

print
(
'*'
)
;

x

:=

x

-

100

This is the regex I currently have so far:

str.split("\\s+|"+
          "(?=[^\\w'][^']*('[^']*'[^']*)*$)|" +
          "(?<=[^\\w'])(?=[^']*('[^']*'[^']*)*$)|" +
          "(?=('[^']*'[^']*)*$)|" +
          "(?<=')(?=[^']*('[^']*'[^']*)*$)");

But this gives me the following result:

print
(
'*'
)
;

x

:    
=    <!-- This is the problem. Should be above next to the :

x

-

100

UPDATE

I have now learned that it's not possible to achieve this using Regex.

However, I still cannot use any external or frameworks or lexers, and have to use included Java methods, such as StringTokenizer.

Dark Knight
  • 297
  • 1
  • 3
  • 17
  • 2
    You cannot do (a) with a regular expression, period. A language with matched delimiter pairs is not a regular language. You need to write/use a proper lexer. – OrangeDog Sep 24 '16 at 21:35
  • can't he use lookback and lookforward in some way? – Gus Sep 24 '16 at 21:36
  • @OrangeDog But it works well with the current regex, however only with one of the two constraints. Is it not possible to add additional regex for constraint `(2)`? – Dark Knight Sep 24 '16 at 21:38
  • 2
    @Gus No. For the same reason [you cannot parse html with a regular expression](http://stackoverflow.com/a/1732454/476716). – OrangeDog Sep 24 '16 at 21:38
  • @DarkKnight no it doesn't work well. It just happens to work for your specific example, but it will quickly break down with a more complicated structure of nested quotes. – OrangeDog Sep 24 '16 at 21:39
  • @OrangeDog I see. Is it possible to use any other java method such as `find` or `patterns` to get the wanted result? – Dark Knight Sep 24 '16 at 21:40
  • @DarkKnight [StreamTokenizer](https://docs.oracle.com/javase/8/docs/api/java/io/StreamTokenizer.html) is a good place to start. – OrangeDog Sep 24 '16 at 21:43

1 Answers1

1

Disclaimer: Regex is not a generic parser. If the text you're reading is a complex language, with nested constructs, then you need to use an actual lexer, not a regex. E.g. the code below supports "Characters surrounded by ' '", which is a simple definition, but if the characters can contain escaped ' characters, you'll need a lexer.

Don't use split().

Your code will be much easier to read and understand if you use a find() loop. It'll also perform better.

You write your regex to specify what you want to capture in one iteration of the find() loop. You can rely on | to choose the first pattern that matches, so put more specific patterns first.

Pattern p = Pattern.compile("\\s+" +    // sequence of whitespace
                           "|\\w+" +    // sequence of word characters
                           "|'[^']*'" + // Characters surrounded by ' '
                           "|[:><]=" +  // :=   >=   <=
                           "|<>" +      // <>
                           "|\\.\\." +  // ..
                           "|.");       // Any single other character
String input = "print('*');  x := x - 100";
for (Matcher m = p.matcher(input); m.find(); )
    System.out.println(m.group());

Output

print
(
'*'
)
;

x

:=

x

-

100
Andreas
  • 154,647
  • 11
  • 152
  • 247