4

I have an array of strings String[] words and a 28000 word Word-list.

I want to check if any member of the String array is in the WordList (the word-list is in a text file wordlist.txt)

What is the most efficient way to go about this?

Eduardo
  • 6,900
  • 17
  • 77
  • 121

8 Answers8

9

Place the strings directly into a HashSet<String> rather than an array and iterate through the file using contains on the set to check the content. You wont improve on O(1) access. This will also mimimize memory used to store the Strings should any duplicates exist.

Reimeus
  • 158,255
  • 15
  • 216
  • 276
  • So its not a good idea to read the whole wordlist into a HashSet aswell then? – Eduardo Sep 06 '13 at 13:39
  • No, you'll want to remove duplicates to reduce memory footprint so using a HashSet directly is the way (plus you wont have to repopulate the HashSet - tiny added win in terms of code use also!) – Reimeus Sep 06 '13 at 13:55
2

You can try the array (tree) suffix algorithm, but you need to implement, look this:

Longest palindrome in a string using suffix tree

Community
  • 1
  • 1
Omar Mainegra
  • 4,006
  • 22
  • 30
1

Step1:Don't use string array. Instead of use HashSet.

Step2: Load the file(that is wordlist.txt) contents into another HashSet

Step3:

Set<String> set1 = new HashSet<String>(); //Load the string array into set
    Set<String> set2 = new HashSet<String>(); //load the file contents into set
    for (String str : set1) {
        for (String str2 : set2) {
            if (str.equalsIgnoreCase(str2)) {
                break;
            }
        }
    }
SmartTechie
  • 123
  • 1
  • 10
  • It would be faster to check if set2.contains(str). That is an order 1 operation and one of the main benefits of using sets. It turns this from O(n^2) to O(n) – Bryan Nov 03 '14 at 03:59
0

You can use HashSet<String> or ArrayList<String> which has contains method. It will check if your String is stored or not.
The difference between HashSet and ArrayList is hashset won't allow duplicate value and it will not maintain the order while arraylist allows you duplicates and its an ordered collection. But HashSet is more efficient than arraylist to perform a search operations.

Vimal Bera
  • 10,346
  • 4
  • 25
  • 47
  • 1
    I wouldn't recommend arraylist, that has to check all the items, whereas HashSet has to check .equals() for those that have the same hashCode()... – ppeterka Sep 06 '13 at 12:59
  • I would not call `ArrayList` search efficient for large lists. – kiheru Sep 06 '13 at 12:59
  • I really think using Java's inbuilt data types for such a search is grossly inefficient. – metsburg Sep 06 '13 at 13:01
  • @metsburg so you have a better implementation? ;-) – Philipp Sander Sep 06 '13 at 13:02
  • Thanks for the comments. Yea, HashSet is more efficient than ArrayList. I just showing him 2 ways and its difference – Vimal Bera Sep 06 '13 at 13:03
  • @PhilippSander A trie might qualify as more efficient, especially if the word list is large – kiheru Sep 06 '13 at 13:10
  • @PhilippSander : Yes, a radix tree (or a Patricia trie) would be the best data structure for storing words. Hashset only makes sense in terms of memory usage, but radix tree is better since it avoids the time to compute hash keys. Also, predecessor/successor cannot be implemented using hash set. Moreover, Coherence API seems to provide a memory optimized radix tree implementation: http://docs.oracle.com/cd/E24290_01/coh.371/e22843/com/tangosol/util/BinaryRadixTree.html – metsburg Sep 06 '13 at 15:23
0

Create a HashSet of Strings as

HashSet<String> wordSet = new HashSet<String>(Arrays.asList(words));

And Check for word in HashSet with HashSet.contains(Object o) method where word is the word you want to check if exists.

TheKojuEffect
  • 20,103
  • 19
  • 89
  • 125
0

Store instead of the original words.txt a serialized HashSet. As a separate step from running the application.

The application then only needs to load the hash set once.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0

HashSet's add() returns false if the word is already present in the set.

for (String str : words) {
  if (!wordSet.add(str)) {
    System.out.println("The word " + str + " is already contained.");
  }
}

This is a bit more sophisticated and less low-level than contains().

András Hummer
  • 960
  • 1
  • 17
  • 35
0

A HashSet will suffice if your list of words can fit in memory.

If memory size is a concern use a BloomFilter. While it is possible for bloom filter to give the wrong answer, you can tune the probability with which it happens.

Andrejs
  • 26,885
  • 12
  • 107
  • 96