1

I was writing some code which needed to accept user calculator input, so as part of it I figured I'd use regular expressions to tokenize an input string, but tokenizing the string itself fails my unit tests for decimals and "]".

I started by using the lookahead and lookbehind method that I saw here.

I wrote with "((?<=[+-/*(){^}[%]π])|(?=[+-/*(){^}[%]π]))"; which compiled and ran successfully, except it failed if there was a number with a decimal.

I went back and I tried it the same way the accepted answer does in the linked question using "[+-/*\\^%(){}[]]"(regex3 below) both with and without the π because my first instinct would be the character which caused the issue, but in both cases it resulted in Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 41 ((?<=[+-/*\^%(){}[]])|(?=[+-/*\^%(){}[]]))

At this point, I went back to my first try and rearranged the terms, "((?<=[+-/*^%(){}[]π])|(?=[+-/*^%(){}[]π]))"; (regex2 below) but this one also had the same PatternSyntaxException on the last parenthesis.

It'd probably be easier to just show the problem in code, I wrote a class to run three different regex class attempts :

import java.util.Arrays;
public class RegexProblem {
    /** This Delimiter string came from {@link https://stackoverflow.com/a/2206432/} */
    static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";


    // Split on and include + - * / ^ % ( ) [ ] { } π
    public static void main(String[] args) {

        String regex1="((?<=[+-/*(){^}[%]π])|(?=[+-/*(){^}[%]π]))";
        String regex2="((?<=[+-/*^%(){}[]π])|(?=[+-/*^%(){}[]π]))";
        String regex3="[+-/*\\^%(){}[]]";

        String str="1.2+3-4^5*6/(78%9π)+[{0+-1}*2]";
        String str2="[1.2+3]*4";


        String[] expected={"1.2","+","3","-","4","^","5","*","6","(","78","%",
                           "9","π",")","+","[","{","0","+","-","1","}","*","2","]"};
        String[] expected2={"[","1.2","+","3","]","*","4"};


        System.out.println("Expected: ");
        System.out.print("str: ");
        System.out.println(Arrays.toString(expected));
        System.out.print("str2: ");
        System.out.println(Arrays.toString(expected2));
        System.out.println();


        System.out.println();
        System.out.println("Regex1: ");
        System.out.print("str: ");
        System.out.println(Arrays.toString(str.split(regex1)));
        System.out.print("str2: ");
        System.out.println(Arrays.toString(str2.split(regex1)));
        System.out.println();
        System.out.println("Regex2: ");
        System.out.print("str: ");
        System.out.println(Arrays.toString(str.split(regex2)));
        System.out.print("str2: ");
        System.out.println(Arrays.toString(str2.split(regex2)));
        System.out.println();
        System.out.println("Regex3: ");
        System.out.print("str: ");
        System.out.print(Arrays.toString(str.split(String.format(WITH_DELIMITER, regex3))));
        System.out.print("str2: ");
        System.out.print(Arrays.toString(str2.split(String.format(WITH_DELIMITER, regex3))));

    }

}

Running regex2 and regex 3 both failed, but what baffles me is the behavior of regex1, which will run even though it appears to have the same amount of closing characters as the others, and splits using "." but not "]".

Matthew0898
  • 263
  • 2
  • 13
  • 1
    Hyphen (`-`) is a special character when it occurs within square brackets, defining a character range starting with the character on its left, ending with the character on its right. You can disable this functionality by either escaping the hyphen with a backslash, or simply placing it as the leftmost or rightmost character within the square brackets. – CAustin Apr 05 '19 at 20:59
  • @CAustin Thanks!. I tried changing it to `regex3="[+\\-/*\\^%(){}[]]";` and commenting out regex2, but I still got `Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 43 ((?<=[+\-/*\^%(){}[]])|(?=[+\-/*\^%(){}[]]))` – Matthew0898 Apr 05 '19 at 21:02
  • 1
    The same thing is with square brackets. The first `]` is treated as a closing bracket for characters list. Try escaping this one as well. – Egan Wolf Apr 05 '19 at 21:05
  • 1
    it's a bit strange that it's causing an exception, but as written, it's definitely not going to behave the way you expect. You have `[]` within your square brackets, which will cause the character set to terminate on that first right square bracket. – CAustin Apr 05 '19 at 21:07
  • Escaping "[", "-", and "]" solves the issue. The only part that still confuses me is why 1 ran but 2 and 3 didn't run initially. – Matthew0898 Apr 05 '19 at 21:11
  • Would that be because I had the % inside the brackets? – Matthew0898 Apr 05 '19 at 21:12

1 Answers1

1

Try this:

(?<=[^\d.])|(?=[^\d.])

Explanation:

  • \d is shorthand for [0-9], so any numeral.
  • . within square brackets just matches a literal dot, which appears to always be part of a number in your example input. Therefore, [\d.] is what we'll use to identify number characters.
  • [^\d.] matches a non-number character (carat ^ negates a character class).
  • (?<=[^\d.]) matches a point that's preceded by a non-number character.
  • Alternate (?=[^\d.])matches a point that's followed by a non-number character.
CAustin
  • 4,525
  • 13
  • 25
  • 3
    If your first solution is sub-optimal, edit it out of the answer and provide only the best solution. SO is not a forum, answers should be the BEST answer, not a collection of steps you took to arrive at the best answer. – Jim Garrison Apr 05 '19 at 21:55
  • Operators are non-numbers... I don't know how I missed that. This is a much better way to go about it. – Matthew0898 Apr 06 '19 at 05:09