2

Regular alternating in regex : (expression1|expression2|expression3) This Is fine for few expressions but what if I have a quite large "word bank" from which I would like to construct the regex, How would you suggest implementing it in a more elegant way?

Daniel Braun
  • 2,452
  • 27
  • 25

3 Answers3

3

Here are 3 possible options

  1. Create and search a dictionary

Have a look here to see how to use dictionaries in this application Java: Check if a string is in a dictionary

  1. Create and search an array

    Arrays.asList(yourArray).contains(yourValue)

See here How can I test if an array contains a certain value?

  1. If the list are all mutations of the same word a regex may work after all. For example if your actual terms are (expression1|expression2|expression3) you could search it with /^expression[1-3]$/

You can experiment with Regex here: https://regex101.com/

Community
  • 1
  • 1
yosefrow
  • 2,128
  • 20
  • 29
2

Indeed a regular expression would be impractical for large word lists.

Consider storing the list of words in a java.util.TreeSet, with case-insensitivity as a final flourish:

java.util.Set<String> set = new java.util.TreeSet<>(String.CASE_INSENSITIVE_ORDER);

and use the contains method on set to test if a word is present. Once the set is constructed, contains will give you back the result in O(log N).

Alternatively you could use a java.util.HashSet<String>, which might improve the lookup performance, but then you need to engineer the case-insensitivity yourself.

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
2

The alternative approach of using a set (either TreeSet or HashSet will do, and without knowing details about the strings, it's impossible to say which one would be faster).

However, I'd like to point out that you can indeed build a regex for this, and if you have the word bank in a collection of some kind, you can build the regex automatically from the word bank. The "easy" way is just OR'ing them with a '|' sign, but that's going to be impractical for large word banks, so you'll need a different approach. I'm not going to write detailed code here since I consider it impractical too to use a regex, so I'll just outline it:

  • take the first element from the word bank, inspect its first character
  • split all words that start with this character off into a separate collection, removing the first character from all of them
  • build a regex recursively for that collection
  • the regex for all words starting with that character is that character (escaped if necessary), followed by the regex for the separated collection. The latter must be put in parentheses to control precedence.
  • the regex for the full set is an OR ('|' in regex syntax) for the regexes for each starting character, including an OR for the empty string if the collection contains the empty string. This is the "leaf" case in the recursion.
Martin Geisse
  • 1,189
  • 1
  • 9
  • 22