1

I want to code a regex in java. The possible strings for this are:

yyyyyy$
<t>yy<\t>$
<t><t>yyyyy<\t><\t>$
<t><t>y<\t>y<\t><t>yyyyy<\t>yy$

And the strings NOT allowed or possible are:

<t><\t>$ (no “y” in the string)
<t>yy<t><\t>$ (one extra <t> ).

Some Specifications are: There is exactly one $ in any correct string, and this is always the last symbol in the string. The string before the $ must be non-empty, and we call it an expression. An expression is defined recursively as:

  • the letter ‘y’
  • an expression bracketed by <t> and <\t>
  • a sequence of expressions.

The regex I have built is : y+|y*(<t>y+(<t>y*<\t>)*<\t>) Now I am coding this regex in java as: "d+|(d*(<s>d+(<s>d*<\\s>)*<\\s>))$" Code:

private static void checkForPattern(String input) {
            Pattern p = Pattern.compile("  d+    |   (d*(<s>d+(<s>d*<\\s>)*<\\s>))     $");
            //Pattern p= Pattern.compile("d+|d*<s>dd<\\s>$");
            Matcher m = p.matcher(input);
            if (m.matches()) {
                System.out.println("Correct string");
            } else {
                System.out.println("Wrong string");
            }
        }

What is the error in the syntax as it is saying "wrong" on every String that I am parsing.

James Z
  • 12,209
  • 10
  • 24
  • 44
  • 1
    In your examples does `y` represent a number? Can you give some concrete examples that should match? Why is it `t` in the example and `s` in the pattern? – tadman Mar 26 '18 at 15:52
  • Are you trying to do something xml-ish with regex? I mean the "$" constraint can easily be checked apart from the rest and that rest looks somewhat xml-ish to me. – Fildor Mar 26 '18 at 15:53
  • 1
    You can't *really* balance groups in Java's regex engine. Just count the number of occurrences of `` as well as `` and see if they match. Also, count the number of occurrences of `y`. Done, it's that simple. – ctwheels Mar 26 '18 at 15:55
  • 3
    Having pairs (of tags) makes the grammar irregular, thus you can't/shouldn't use regular expressions here. – fardjad Mar 26 '18 at 15:58
  • Write an expression parser. – Andreas Mar 26 '18 at 16:25

2 Answers2

0

I would suggest not using regex for this since Java's regex engine cannot effectively balance the number of <t> vs <\t> occurrences like other regex engines can (i.e. .NET). Even doing this in those engines is fairly complex and there are likely better ways to go about your problem. The code below does just this: It counts the number of occurrences of <t> and ensures the same number of <\t> exists. Similarly, it counts the number of occurrences of y and ensures there's more than 0 instances. The logic for the countOccurrences method was adapted from this answer on the question Occurrences of substring in a string.

See code in use here

class Main {
    public static void main(String[] args) {
        String[] strings = {
            "yyyyyy$",
            "<t>yy<\\t>$",
            "<t><t>yyyyy<\\t><\\t>$",
            "<t><t>y<\\t>y<\\t><t>yyyyy<\\t>yy$",
            "<t><\\t>$",
            "<t>yy<t><\\t>$"
        };

        for(String s : strings) {
            if (countOccurrences("<t>", s) == countOccurrences("<\\t>", s) && countOccurrences("y", s) > 0) {
                System.out.println("Good: " + s);
            } else {
                System.out.println("Bad: " + s);
            }
        }
    }

    private static int countOccurrences(String needle, String haystack) {
        int lastIndex = 0;
        int count = 0;
        while (lastIndex != -1) {
                lastIndex = haystack.indexOf(needle, lastIndex);
                if (lastIndex != -1) {
                    count++;
                    lastIndex += needle.length();
                }
        }
        return count;
    }
}

Result:

Good: yyyyyy$
Good: <t>yy<\t>$
Good: <t><t>yyyyy<\t><\t>$
Good: <t><t>y<\t>y<\t><t>yyyyy<\t>yy$
Bad: <t><\t>$
Bad: <t>yy<t><\t>$
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • I am doubtful whether my regex is correct or not. I am asked to use $ sign as the terminator of the string. But when I am writing $ at the end of the string it is giving "wrong" and without $ sign, it is accepting the string. Whereas only those strings should be accepted which has a $ at the end of them. – Sidra Jawaid Mar 26 '18 at 17:29
  • @SidraJawaid not sure I understand what you mean, you don't need regex at all to accomplish what you're trying to do. – ctwheels Mar 26 '18 at 17:30
  • Then why do we define Patterns and then match strings? – Sidra Jawaid Mar 26 '18 at 19:59
  • @SidraJawaid I can't tell you why you've done this, but I'm showing you how I would go about solving your problem. There's no need for regex here as it's **not** the best tool for the job. – ctwheels Mar 26 '18 at 20:02
0

After a thorough research and reading, I have concluded that regex for such type of Language cannot be created as it is an infinite Automata (regex for infinite automata cannot be created). So to solve this problem we will have to create the CFG directly. CFG for the above mentioned problem is below:

R --> <t>S<\t>$(1.1 production)
R-->SS$(1.2 production)
R-->y$(1.3 production)
S--><t>S<\t>(2.1 production)
S-->SS(2.2 production)
S-->y(2.3 production)