0

Java 8 here. I am given a list of blacklisted words/expressions as well as an input string. I need to determine if any of those blacklisted items appears in the input string:

List<String> blacklist = new ArrayList<>();

// populate the blacklist and "normalize" it by removing whitespace and converting to lower case
blacklist.add("Call for info".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Travel".toLowerCase().replaceAll("\\s", ""));
blacklist.add("To be determined".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Meals".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Custom Call".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Custom".toLowerCase().replaceAll("\\s", ""));

// obtain the input string and also "normalize" it
String input = getSomehow().toLowerCase().replaceAll("\\s", ""));

// now determine if any blacklisted words/expressions appear inside the input
for(String blItem : blacklist) {
    if (input.contains(blItem)) {
        throw new RuntimeException("IMPOSSSSSSSIBLE!")
    }
}

I thought this was working great until my input string contained the word "Customer" inside of it.

Since custom exists inside customer, the program is throwing an exception. Instead, I want it to be allowed, because "customer" is not a blacklisted word.

So I think the actual logic here is:

  • If the input string contains a blacklist word...
  • ...AND the blacklist word is preceded by either the beginning of the string or a non-alphabetical ([a-z]) character...
  • ...AND the blacklist word is succeeded by either the end of the string or a non-alphabetical charatcer...
  • ...then throw the exception

I think that would cover all my bases.

Does Java 8 or any (Apache or otherwise) "commons" library have anything that will help me here? For some reason I'm having a hard time wrapping my head around this and making the code look elegant (I'm not sure how to check for the beginning/ending of a string from inside a regex, etc.).

Any ideas?

Stefan Zobel
  • 3,182
  • 7
  • 28
  • 38
hotmeatballsoup
  • 385
  • 6
  • 58
  • 136
  • You should probably use use a regex with `\b`(word boundary) , instead of `String.contains()` – BlackPearl Jan 07 '20 at 18:21
  • as @BlackPearl has suggested, I would also use an anchored regex if you end the word "custom" with a "$" it implies that nothing can come after the m in custom, if you wanted to take it one step further you can anchor the beginning of custom with a "^" to imply that nothing comes before the c in custom :) – CyberStems Jan 07 '20 at 19:03

1 Answers1

2

You can pre-compile a list of Patterns for the given words.

\b indicates a word boundary. Adding a word boundary on both sides of a String will match the regex for exact words.

List<Pattern> blackListPatterns =
    blackList
        .stream()
        .map(
                word -> Pattern.compile("\\b" + Pattern.quote(word) + "\\b")
        ).collect(Collectors.toList());

Then you can match the word with the Pattern List.

If you are sure your word will not contain any metacharacters like (,* .etc, you can directly create your Pattern from the String instead of using Pattern.quote(), which is used to escape metacharacters.

for (Pattern pattern : blackListPatterns) {
    if (pattern.matcher(input).find()) {
        throw new RuntimeException("IMPOSSSSSSSIBLE!")
    }
}
BlackPearl
  • 1,662
  • 1
  • 8
  • 16
  • Thanks @BlackPearl (+1) I think you're close, but keep in mind that `input` is a string and comes from end user input, so it could absolutely contain punctuation and meta characters. In reality, the `input` might be something like: "_Per customer request (make sure to follow up)_". In that example, no blacklist words were provided, so we want to allow it. Also, just to double check, will this word boundary concept (`\b`) include beginning/ending of string **as well as** non-numeric characters like whitespaces, punctuation, etc.? Thanks again! – hotmeatballsoup Jan 07 '20 at 19:06
  • 1
    @hotmeatballsoup perhaps a [tutorial](https://docs.oracle.com/javase/tutorial/essential/regex/) is appropriate? – Abra Jan 07 '20 at 19:56
  • Non-word characters include all characters other than alphanumeric characters (-, - and -) and underscore (_). – BlackPearl Jan 08 '20 at 06:31