1

I need to make a method that will retrieve words from the text without anything (punctuation etc.) except lowercase words themselves.

BUT I've struggled for 2 hours with regex pattern and faced such a problem. There are words like "50-year" in the text. And with my regex, output will be like:

-year

Instead of a normal

year

But I cannot replace dash symbol "-" cause there is another words with hyphen that should be left.

Here is a code:

 public List<String> retrieveWordsFromFile() {
        List<String> wordsFromText = new ArrayList<>();

        scanner.useDelimiter("\\n+|\\s+|'");

        while (scanner.hasNext()) {
            wordsFromText.add(scanner.next()
                .toLowerCase()
                .replaceAll("^s$", "is")
                .replaceAll("[^\\p{Lower}\\-]", "")
            );
        }
        wordsFromText.removeIf(word -> word.equals(""));
        return wordsFromText;
    }

So how can I say that I need to replace everything except text and words with dash starting only with a letter/s. So this regex string should probably be such a "merged" into one sequence?

KennyWood
  • 15
  • 4

1 Answers1

0

Use the regex, \\b[\\p{Lower}]+\\-[\\p{Lower}]+\\b|\\b[\\p{Lower}]+\\b

Demo:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        // Test strings
        String[] arr = { "Hello world", "Hello world 123", "HELLO world", "50-year", "stack-overflow" };

        // Define regex pattern
        Pattern pattern = Pattern.compile("\\b[\\p{Lower}]+\\-[\\p{Lower}]+\\b|\\b[\\p{Lower}]+\\b");

        for (String s : arr) {
            // The string to be matched
            Matcher matcher = pattern.matcher(s);

            while (matcher.find()) {
                // Matched string
                String matchedStr = matcher.group();

                // Display the matched string
                System.out.println(matchedStr);
            }
        }
    }
}

Output:

world
world
world
year
stack-overflow

Explanation of regex:

  1. \b species the word boundary.
  2. + specifies one or more characters.
  3. | specifies OR

This is how you can discard the non-matching text:

public class Main {
    public static void main(String[] args) {
        // Test strings
        String[] arr = { "Hello world", "Hello world 123", "HELLO world", "50-year", "stack-overflow", "HELLO",
                "HELLO WORLD", "&^*%", "hello", "123", "1w23" };

        // Regex pattern
        String regex = ".*?(\\b[\\p{Lower}]+\\-[\\p{Lower}]+\\b|\\b[\\p{Lower}]+\\b).*";

        for (String s : arr) {
            // Replace the string with group(1)
            String str = s.replaceAll(regex, "$1");

            // If the replaced string does not match the regex pattern, replace it with
            // empty string
            s = !str.matches(regex) ? "" : str;

            // Display the replaced string if it is not empty
            if (!s.isEmpty()) {
                System.out.println(s);
            }
        }
    }
}

Output:

world
world
world
year
stack-overflow
hello

Explanation of replacement:

  1. .*? matches everything reluctantly i.e. before it yields to the next pattern.
  2. s.replaceAll(regex, "$1") will replace s with group(1)
Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110
  • Thank you for reply! Yea in this case it works perfect but what to do with hyphenated words? For example, if it were "mother-in-law"? Output would be "motherinlaw" – KennyWood Jul 13 '20 at 15:08
  • If I understood your question clearly, you wanted `year` from `50-year` which is already fulfilled by the solution. Is there any other scenarios which you need to be covered? – Arvind Kumar Avinash Jul 13 '20 at 15:11
  • Yea exactly "year" instead of "-year" but as I mentioned above what should I do with hyphened words?They should remain as they are – KennyWood Jul 13 '20 at 15:16
  • Got it. Check the updated answer and let me know if it fulfils your requirement. – Arvind Kumar Avinash Jul 13 '20 at 15:22
  • Yea with matches() it works how it's should. Мuch obliged to you! But... If you know how to use your regex pattern with replaceAll() method vice versa .i.e. replaceAll() except your regex pattern. Is it possible? If yes - you're God like :D <3 – KennyWood Jul 13 '20 at 16:03
  • @KennyWood - I hope the updated answer fulfils your requirements. – Arvind Kumar Avinash Jul 13 '20 at 19:24
  • Wonderful! Thank you a lot for this detailed and professional answer!<3 – KennyWood Jul 13 '20 at 21:32
  • Last question - why is \\p{Lower} in the brackets [ ] in this case? – KennyWood Jul 13 '20 at 21:48
  • You are most welcome. `why is \\p{Lower} in the brackets [ ] in this case?` - It's not mandatory in this case but I do not see any side-effects as well and therefore I've put it for better readability. Anything inside `[ ]` is counted as `one of` e.g. `[abc]` means one of `a` or `b` or `c`. – Arvind Kumar Avinash Jul 13 '20 at 22:08