I need to make a method that will retrieve words from the text without anything (punctuation etc.) except lowercase words themselves.
BUT I've struggled for 2 hours with regex pattern and faced such a problem. There are words like "50-year" in the text. And with my regex, output will be like:
-year
Instead of a normal
year
But I cannot replace dash symbol "-" cause there is another words with hyphen that should be left.
Here is a code:
public List<String> retrieveWordsFromFile() {
List<String> wordsFromText = new ArrayList<>();
scanner.useDelimiter("\\n+|\\s+|'");
while (scanner.hasNext()) {
wordsFromText.add(scanner.next()
.toLowerCase()
.replaceAll("^s$", "is")
.replaceAll("[^\\p{Lower}\\-]", "")
);
}
wordsFromText.removeIf(word -> word.equals(""));
return wordsFromText;
}
So how can I say that I need to replace everything except text and words with dash starting only with a letter/s. So this regex string should probably be such a "merged" into one sequence?