0

I am working on a project where I need to scan a folder and and scan each file for a specific word (Say '@MyPattern').

I was looking forward to a best approach to design such a scenario. For a starter I have been working as below :

    //Read File
    List<String> lines = new ArrayList<>();
    try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
        stream.forEach(line-> lines.add(line));
    } catch (IOException e) {
        e.printStackTrace();
    }

    //Create a pattern to find for
    Predicate<String> patternFilter = Pattern
            .compile("@MyPattern^(.+)")
            .asPredicate();

    //Apply predicate filter
    List<String> desiredWordsMatchingPattern = lines
            .stream()
            .filter(patternFilter)
            .collect(Collectors.<String>toList());

    //Perform desired operation
    desiredWordsMatchingPattern.forEach(System.out::println);

I am not sure why this isn't working even though there are multiple words matching '@MyPattern' in the file.

Einstein_AB
  • 396
  • 5
  • 22
  • Suggestion : Just double check your regex once. – Naman Jan 31 '19 at 11:16
  • Looks like an issue in your regex. – Ravindra Ranwala Jan 31 '19 at 11:17
  • My string is like : "@Traces("10869") @Details('User is viewing the user profile') given: The user is open to user profile" I am looking forward to extract "10869" after @Traces. What should be the regex for so – Einstein_AB Jan 31 '19 at 11:21
  • A regex like `@MyPattern` will match `@MyPattern` and nothing else, i.e. it will not match `@Traces` (why should it?). Besides that, your predicate will select lines containing a match, but not extract the match. You could use a `Scanner` for that. – Holger Jan 31 '19 at 12:42
  • @Holger my string was like '@MyPattern("10869"), Sorry for the typo – Einstein_AB Feb 14 '19 at 17:34
  • The, you should use `Pattern.compile("@MyPattern\\(.+\\)")` or, if yo want to capture the contents between the brackets, `Pattern.compile("@MyPattern\\((.+)\\)")`. As said, if you only want the matches, rather than the lines containing the matches, you should not stream over the lines at all, but rather use `Scanner`. E.g., compare with [this answer](https://stackoverflow.com/a/40304028/2711488) – Holger Feb 15 '19 at 09:36

2 Answers2

2

The way you use ^(.+) does not make sense in a regular expression. ^ matches the beginning of the string (line), but the beginning of the string cannot come after the pattern (only if the pattern would match the empty string, which it doesn’t here). So your pattern can never match any line.

Just use:

        Predicate<String> patternFilter = Pattern
                .compile("@MyPattern")
                .asPredicate();

If you require that no chars come after the pattern (not even whitespace), the $ matches the end of the string:

        Predicate<String> patternFilter = Pattern
                .compile("@MyPattern$")
                .asPredicate();
Ole V.V.
  • 81,772
  • 15
  • 137
  • 161
2

here's my solution:

    // can extract annotation and text-inside-parentheses
    private static final String REGEX = "@(\\w+)\\((.+)\\)";


    //Read File
    List<String> lines = Files.readAllLines(Paths.get(filename));

    //Create a pattern to find for
    Pattern pattern = Pattern.compile(REGEX);

    // extractor function uses pattern's second group (text-within-parentheses)
    Function<String, String> extractOnlyTextWithinParentheses = s -> {
        Matcher m = pattern.matcher(s);
        m.find();
        return m.group(2);
    };

    // all lines are filtered and text will be extracted using extractor-fn
    Stream<String> streamOfExtracted = lines.stream()
            .filter(pattern.asPredicate())
            .map(extractOnlyTextWithinParentheses);

    //Perform desired operation
    streamOfExtracted.forEach(System.out::println);

Explanation:

Let's first clarify what the used regex-pattern @(\\w+)\\((.+)\\) should do:

ASSUMING: you filter the text for a Java-like annotation like @MyPattern

matching specific lines using regular expression

  • @\\w+ matches an at-symbol followed by a word (\\w is special meaning and stands for word, i.e. alphabetic letter and underscores). So it will match any annotation (e.g. @Trace, @User and so on).
  • \\(.+\\) matches some text inside parentheses (e.g. ("10869"), where parentheses must be escaped too \\( and \\) and .+ for any non-empty text inside

Note: unescaped parentheses have a special meaning inside any regular expression, that is grouping & capturing

For matching parentheses and extract their contents see this answer on Pattern to extract text between parenthesis.

extracting text using capture groups inside regular expression

Simply use parentheses (un-escaped) to form a group and remember their order-number. (grouped)(Regex) will match the text groupedRegex and can extract two groups:

  • group #1: grouped
  • group #2: Regex To get these groups use matcher.find() and then matcher.group() or its overloaded methods.

option to test the regular expression and extraction

When inside IntelliJ you could use the action Check RegExp in IntelliJ: ALT+Enter on the selected regex to test and adapt it. Similar there are quite many websites to test regular expressions. For example http://www.regExPlanet.com also supports Java-RegEx-Syntax and you can verify extracted groups online. See example on RegexPlanet.

Note: There is one special meaning of the caret besides beginning like Ole answered above: this [^)]+ means match anything (at least 1 character) except the closing parentheses

make it extendable using an extractor-functional

If you replace the extract-Function used as argument to the .map(..) above by following you can also print both, the annotation-name and text-inside-parentheses (tab-separated):

Function<String, String> extractAnnotationAndTextWithinParentheses = s -> {
        Matcher m = pattern.matcher(s);
        m.find();
        StringBuilder sb = new StringBuilder();
        int lastGroup = m.groupCount();
        for (int i = 1; i <= lastGroup; i++) {
            sb.append(m.group(i));
            if (i < lastGroup) sb.append("\t");
        }
        return sb.toString();
};

Summary:

Your streaming was effective. Your regular expression had an error:

  • it almost matched on a constant annotation, namely @MyPattern
  • you tried capturing correclty using parentheses
  • there was a syntax-error or typo inside your regular expression, the caret ^
  • not using escaped parentheses \\( and \\) you would have gotten not only text-inside but also parentheses as extract
hc_dev
  • 8,389
  • 1
  • 26
  • 38