1

So the ArrayList "comb" contains Strings of equal length and variations of some characters. In worst case scenario this List can contain around 100,000 words. The function checkWord(String str) takes one word as a parameter and checks whether that word is present in the Hashtable dictionary(which contains another 90,000~ words, a text file was read into this hashtable). So basically the code needs to check which of the words from List "comb" is present in the HashTable "dictionary". In worst case scenario this search takes upto 5 minutes. I want to implement Runnable and parallelise it, but unsure how to go about it.

For example: the list comb contains various misspellings of CURMUDGEON and the correct word itself. This list contains 98415 of them. CURMUEGEON CURMUEGEOH CURMUEGEOJ CURMUEGEKN etc etc. So checking if each of these words are present in the hash table takes 200 seconds. I want to bring down this time

class key implements Runnable{
    public static ArrayList<String> comb;
    public static Hashtable<String,String> dictionary; 
    public static void main(String[] args) throws IOException{
        key obj = new key();
        Thread thread1 = new Thread(obj);
        thread1.start();
    }
    public static Boolean checkWord(String str){
                String toCheck = str.toLowerCase();
                if(dictionary.contains(toCheck)){
                    return true;
                }
                else
                 return false;
      }
        public void run(){
            for(String x:comb)
                if ( checkWord(x) )
                    filtered.add(x);

        }
daipayan
  • 319
  • 3
  • 5
  • 17
  • 1
    Looking up 100,000 words in a HashMap should be of the order of seconds, if that. There's no point in making that multi-threaded. Are you sure `dictionary` is really a hashtable-based data structure? Please provide a [mcve]. – Jon Skeet Feb 16 '17 at 16:36
  • @JonSkeet Thank you, I have edited and updated my question. – daipayan Feb 16 '17 at 17:08
  • 2
    Hmm ... 1) Why is this a `Map` and not a `Set`? What are values of the `Map`? 2) Parallelization comes at the cost of complexity. If the values are irrelevant, we do have better algorithms for [set intersection](http://stackoverflow.com/questions/4642172/computing-set-intersection-in-linear-time). – dhke Feb 16 '17 at 17:15
  • @dhke Thank you! Changing from hashmap to hashset solved the problem! – daipayan Feb 16 '17 at 17:41
  • 1
    @daipayan That's really odd. What runtime environment are you using? Because in OpenJDK and friends, `HashSet` is [implemented on top of `HashMap`](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashSet.java?av=f#105). – dhke Feb 16 '17 at 17:48
  • Well at the moment, that code is incomplete, and nothing ever populates the dictionary. But if you *did* populate both `dictionary` and `comb` with 100,000 words, I would be shocked to see it take 5 minutes. – Jon Skeet Feb 16 '17 at 18:02
  • @dhke I have the same doubt! the time required to do the same operation using Hashtable took 19800 ms and Hashset took 48 ms. Im running JRE 1.8 from Oracle. Was it because I used HashTable instead of HashMap? – daipayan Feb 16 '17 at 18:22
  • @daipayan Most probably. [Hashtable is synchronized](https://docs.oracle.com/javase/8/docs/api/java/util/Hashtable.html), so every lookup obtains the monitor lock, which kills performance. Using `Hashtable` in modern code is probably never a good idea. – dhke Feb 16 '17 at 18:47

2 Answers2

1

HashTable is a legacy JDK1.0 API class with very strong concurrency guarantees. In particular,

Unlike the new collection implementations, Hashtable is synchronized.

This means that every operation on Hashtable needs to obtain the monitor lock, which is a performance killer for repeated lookups. It is probably best to follow the recommendations given in the JDK javadocs:

If a thread-safe implementation is not needed, it is recommended to use HashMap in place of Hashtable. If a thread-safe highly-concurrent implementation is desired, then it is recommended to use ConcurrentHashMap in place of Hashtable.

dhke
  • 15,008
  • 2
  • 39
  • 56
0

To make this efficient, you need multiple Runnables that are testing different ranges of the comb list independently, like

public class MySearcher implements Runnable {
  ArrayList list;
  int startIdx, endIdx;
  public MySearcher(list, startIdx, endIdx) {
    // copy into object fields
  }
  public void run () {
    // test all values in the list between startIdx and endIdx
    // put results into a data structure. Create a method to get/return that data structure
  }
}

Then you can use an ExecutorService for all your Runnables (for usage, see the javadoc: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html)

ControlAltDel
  • 33,923
  • 10
  • 53
  • 80