3

I have a regular expression, which selects all the words that contains all (not! any) of the specific letters, just works fine on Notepad++.

Regular Expression Pattern;

^(?=.*B)(?=.*T)(?=.*L).+$

Input Text File;

AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB

And output of the regular expression in notepad++;

LABAT
BALAT
LATAB

As It is useful for Notepad++, I tried the same regular expression on java but it is simply failed.

Here is my test code;

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lev.kelimelik.resource.*;

public class Test {

    public static void main(String[] args) {
        String patternString = "^(?=.*B)(?=.*T)(?=.*L).+$";

        String dictionary = 
                "AL" + "\n"
                +"BAL" + "\n"
                +"BAK" + "\n"
                +"LABAT" + "\n"
                +"TAL" + "\n"
                +"LAT" + "\n"
                +"BALAT" + "\n"
                +"LA" + "\n"
                +"AB" + "\n"
                +"LATAB" + "\n"
                +"TAB" + "\n";

        Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
        Matcher m = p.matcher(dictionary);
        while(m.find())
        {
            System.out.println("Match: " + m.group());
        }
    }

}

The output is errorneous as below;

Match: AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB

My question is simply, what is the java-compatible version of this regular expression?

Levent Divilioglu
  • 11,198
  • 5
  • 59
  • 106
  • The reason your regex worked in Notepad++ and not in Java is because NPP automatically applies multiline mode to all regexes. If a regex is working for you in NPP and you want to export it to Java, add the MULTILINE flag. (Adding the DOTALL flag, as you did, is equivalent to checking the ". matches newline" box in NPP, and I know you weren't doing that; you would have gotten the same result you did in Java.) – Alan Moore Nov 20 '15 at 17:39

3 Answers3

3

Java-specific answer

In real life, we rarely need to validate lines, and I see that in fact, you just use the input as an array of test data. The most common scenario is reading input line by line and perform checks on it. I agree in Notepad++ it would be a bit different solution, but in Java, a single line should be checked separately.

That said, you should not copy the same approaches on different platforms. What is good in Notepad++ does not have to be good in Java.

I suggest this almost regex-free approach (String#split() still uses it):

String dictionary_str = 
        "AL" + "\n"
        +"BAL" + "\n"
        +"BAK" + "\n"
        +"LABAT" + "\n"
        +"TAL" + "\n"
        +"LAT" + "\n"
        +"BALAT" + "\n"
        +"LA" + "\n"
        +"AB" + "\n"
        +"LATAB" + "\n"
        +"TAB" + "\n";
String[] dictionary = dictionary_str.split("\n"); // Split into lines
for (int i=0; i<dictionary.length; i++)   // Iterate through lines
{
    if(dictionary[i].indexOf("B") > -1 && // There must be B
       dictionary[i].indexOf("T") > -1 && // There must be T
       dictionary[i].indexOf("L") > -1)   // There must be L
    {
        System.out.println("Match: " + dictionary[i]); // No need matching, print the whole line
    }
}

See IDEONE demo

Original regex-based answer

You should not rely on .* ever. This construct causes backtracking issues all the time. In this case, you can easily optimize it with a negated character class and possessive quantifiers:

^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)

The regex breakdown:

  • ^ - start of string
  • (?=[^B]*+B) - right at the start of the string, check for at least one B presence that may be preceded with 0 or more characters other than B
  • (?=[^T]*+T) - still right at the start of the string, check for at least one T presence that may be preceded with 0 or more characters other than T
  • (?=[^L]*+L)- still right at the start of the string, check for at least one L presence that may be preceded with 0 or more characters other than L

See Java demo:

String patternString = "^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)";
String[] dictionary = {"AL", "BAL", "BAK", "LABAT", "TAL", "LAT", "BALAT", "LA", "AB", "LATAB", "TAB"};
for (int i=0; i<dictionary.length; i++)
{
    Pattern p = Pattern.compile(patternString);
    Matcher m = p.matcher(dictionary[i]);
    if(m.find())
    {
        System.out.println("Match: " + dictionary[i]);
    }
}

Output:

Match: LABAT
Match: BALAT
Match: LATAB
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • And the output you get is wrong because you are testing it wrong. I am adding the code now. – Wiktor Stribiżew Nov 20 '15 at 14:12
  • Thanks, I'm really waiting for it but it will be more useful if you also gave an explanation for your answer, not only for me but also to the users that get here from google for the same problem. Thanks or your reply. – Levent Divilioglu Nov 20 '15 at 14:16
  • I hope I added enough details. I did not add any performance testing, it is a known fact that negated character classes perform much better than `.*` (especially in Java). See also [this post: *Does the Java regex library optimize for any characters .\*?*](http://stackoverflow.com/a/33809099/3832970). – Wiktor Stribiżew Nov 20 '15 at 14:27
  • 1
    Thank you very much, that's a fair and useful detailed answer. – Levent Divilioglu Nov 20 '15 at 14:41
  • Yeah sure, just done, and you also could do the same for the question that if you think this question is valueable for other programmers around the web. – Levent Divilioglu Nov 20 '15 at 15:13
  • @LeventDivilioglu: I have, that's why you do not have -1. – Wiktor Stribiżew Nov 20 '15 at 15:17
  • There's some good general advice here, but it doesn't explain why the OP's code didn't work. Try plugging your regexes into *his* code, and you'll get very different results. You "solved" his problem by changing the input from a multiline string to an array of individual words. All he really needed to do was add the `MULTILINE` flag instead of the `DOTALL` flag. – Alan Moore Nov 20 '15 at 18:11
  • @AlanMoore: You made me rethink the current approach. I added another approach to solving the original problem. Thank you. – Wiktor Stribiżew Nov 20 '15 at 18:49
2

Change your Pattern to:

String patternString = ".*(?=.*B)(?=.*L)(?=.*T).*";

Output

Match: LABAT
Match: BALAT
Match: LATAB
Mena
  • 47,782
  • 11
  • 87
  • 106
  • Thanks for the line terminator update, I've updated the Pattern.compile parameter however, again the output is not valid. Can you check out the code again? This time, the result brings up all the words which is still errorneous. – Levent Divilioglu Nov 20 '15 at 14:07
  • @LeventDivilioglu sorry confused the requirements. My new answer is probably more like what you want. – Mena Nov 20 '15 at 14:13
1

I did not debug your situation, but I think your problem is caused by matching the entire string rather than individual words.

You're matching "AL\nBAL\nBAK\nLABAT\n" plus some more. Of course that string has all the required characters. You can see it in the fact that your output only contains one Match: prefix.

Please have a look at this answer. You need to use Pattern.MULTILINE.

Community
  • 1
  • 1