Java: Very quick way of searching a string in huge dictionary

Question

I had a huge dictionary containing around 1.2 million strings. As an input I will get a sentence and I need to check for each word of input sentence whether it is present in dictionary or not?

Performance is the highest priority for me hence I would like to keep this dictionary in-memory. I want to complete my dictionary lookup in less than a millisecond. Kindly suggest how I can achieve this? Any existing external API which do this?

http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html — Nir Alfasi, Jul 28 '14 at 06:38
@immibis: HashMap may be bad if you need prefix searches, to find pluralized words, etc. — Thilo, Jul 28 '14 at 06:38
I recommend using a HashMap, or if it's too slow, a *trie* (http://en.wikipedia.org/wiki/Trie). — kviiri, Jul 28 '14 at 06:39
I suggest you use *ranged* `HashMap`. i.e, keep words from a-d in one map, e-g in another etc.. And use multiple threads to look up different words in the sentence. finally combine the result of each thread (use executor for threading..) — TheLostMind, Jul 28 '14 at 06:44
I'll surely recommend you hashmap.. but with optimization.. http://stackoverflow.com/questions/1757363/java-hashmap-performance-optimization-alternative — Pramod S. Nikam, Jul 28 '14 at 06:46
Why is everybody answering in the comments? If you wrote an answer, it could be upvoted and accepted ... — meriton, Jul 28 '14 at 06:58

score 2 · Accepted Answer · answered Jul 28 '14 at 06:46

So you only need a set of words from the dictionary and see whether it contains the set of words of the sentence.

Set<String> dictionaryIndex = new HashSet<>();
Set<String> sentence = new HashSet<>();

if (!dictionaryIndex.containsAll(sentence)) {
    ...

However if you want to do more, consider a database, maybe an embedded in-memory database, like H2 or Derby. You can then do more, and a query would be:

SELECT COUNT(*) FROM dictionary WHERE word IN('think', 'possitive', 'human')

You might even consider a spelling library. They keep smaller dictionary and use stemming: 'learn' for learning, learner, learned, learns.

pratZ · Answer 2 · 2014-07-28T07:12:39.283

If you are open to using external apis, I would suggest you go for elastic search's percolate api. Performance being the priority, this exactly fits your requirement.

The java api can be found here.

You can index all the keywords and then provide it a document(in your case the sentence)

Indexing:

for(String obj:keywordLst){
    client.prepareIndex("myindex", ".percolator", obj)
            .setSource(XContentFactory.jsonBuilder()
                .startObject()
                    .field("query", QueryBuilders.matchPhraseQuery("content", obj)) 
                .endObject())
            .setRefresh(true) 
    .execute().actionGet();
}

Searching:

XContentBuilder docBuilder = XContentFactory.jsonBuilder().startObject();
docBuilder.field("doc").startObject(); 
docBuilder.field("content", text);
docBuilder.endObject(); //End of the doc field
docBuilder.endObject(); //End of the JSON root object

PercolateResponse response = client.preparePercolate().setSource(docBuilder)
            .setIndices("myindex").setDocumentType("type")
            .execute().actionGet();


for(PercolateResponse.Match match : response) {
    //found matches
}

score 0 · Answer 3 · answered Jul 28 '14 at 23:51

0

I think the 1.2 million strings will not fit in memory or easily overflow the size limitation of your memory (consider a bad case where the average string length 256).

If some kind of pre-processing is allowed, I think you'd better first reduce the sequence of strings into a sequence of words. It means that you first convert your data into another set of data that will easily fit in memory and won't overflow.

After that, I think you can depend on the in-memory data structures such as HashMap.

answered Jul 28 '14 at 23:51

Byungjoon Lee

913
6
18

Can you please suggest me some techniques of reducing strings into sequence of words? In my case the dictionary is name of cities. – Anuj Mehta Jul 29 '14 at 11:21
I mean, as your job is to check if each word in the input sentence is in the 1.2 million strings, I think you'd better split the 1.2 million strings to make another 'dictionary'. Just simply pre-process all the strings to split them into words to save into the dictionary. In which method, the dictionary can be Map or Set, whatever. After that, using the dictionary, you can find out the words in the given sentence are in the list of 1.2 million strings just by looking up the dictionary. – Byungjoon Lee Jul 29 '14 at 23:59

Java: Very quick way of searching a string in huge dictionary

3 Answers3