1

Please forgive the lousy title.

I have a list of about 430 static "reserved words", each between 2 and 20 characters long. A process on the application runs occasionally which requires checking (potentially tens of) thousands of strings against this set of reserved words to make sure they are all valid.

Is there a java object that is most ideal for this requirement? One that is sorted perhaps?

user3439515
  • 123
  • 4

1 Answers1

3

If you want to use a standard object, use a HashSet. It has O(1) access time in an ideal case. It can degrade if there are collisions (In theory to O(n)). Since you know the set of strings beforehand, you can play with the load-factor a little bit to minimize collisions. In theory, you can also provide a custom hash function by wrapping the strings in an custom object (that would allow you to optimize the function to the distribution of the strings). But unless your strings are somehow really special, I think it would be an overkill.

If you want to/can use a third-party library: you can either use a trie or a finite state automaton. They will be really fast.

What I would recommend: Use HashSet first as it comes with every java. If you see you need something faster, look for a reasonable implementation of a trie. But I expect you will see that the hashset is fast enough.

Jirka
  • 4,184
  • 30
  • 40
  • An FSA would have to be generated at compile time to have an advantage, wouldn't it? – Marko Topolnik Mar 19 '14 at 20:54
  • Yes. But since you have tens of thousands string, you should benefit from using a fsa even if you compile it at the beginning. – Jirka Mar 19 '14 at 20:58
  • @Jirka-x1 Memory differences aside, speaking only of complexity, a trie with fixed-size child lists (w/ index directly computable from character value, e.g. `ch - 'a'`) would have the same performance characteristics as an FSM and may be more convenient to generate. – Jason C Mar 19 '14 at 20:59
  • 1
    @JasonCn But since a generated FSM is just a single method, it has great cache locality. On the other hand, Java's method size limit impedes this approach. – Marko Topolnik Mar 19 '14 at 21:02
  • 1
    That said I'd go with a `HashSet` for now, since it is readily available. Then, if performance requirements are still not met, consider optimizing further by finding / implementing a trie / fsm that implements `Set` and can be dropped in as a replacement. – Jason C Mar 19 '14 at 21:02
  • @MarkoTopolnik Ah, true, great point. An FSM with transition data in an array can overcome method size limits and still may have better cache performance than a trie; although if characters must be mapped to indices that potentially adds a performance hit, e.g. in the Unicode case that Jirka-x1 is about to mention in his next comment (edit crystal ball!). – Jason C Mar 19 '14 at 21:04
  • 1
    @Jason C You are right, assuming he does not need unicode strings. But even then you get the same complications with fsa. – Jirka Mar 19 '14 at 21:04