2

I had a huge dictionary containing around 1.2 million strings. As an input I will get a sentence and I need to check for each word of input sentence whether it is present in dictionary or not?

Performance is the highest priority for me hence I would like to keep this dictionary in-memory. I want to complete my dictionary lookup in less than a millisecond. Kindly suggest how I can achieve this? Any existing external API which do this?

Anuj Mehta
  • 1,072
  • 3
  • 13
  • 25

3 Answers3

2

So you only need a set of words from the dictionary and see whether it contains the set of words of the sentence.

Set<String> dictionaryIndex = new HashSet<>();
Set<String> sentence = new HashSet<>();

if (!dictionaryIndex.containsAll(sentence)) {
    ...

However if you want to do more, consider a database, maybe an embedded in-memory database, like H2 or Derby. You can then do more, and a query would be:

SELECT COUNT(*) FROM dictionary WHERE word IN('think', 'possitive', 'human')

You might even consider a spelling library. They keep smaller dictionary and use stemming: 'learn' for learning, learner, learned, learns.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
1

If you are open to using external apis, I would suggest you go for elastic search's percolate api. Performance being the priority, this exactly fits your requirement.

The java api can be found here.

You can index all the keywords and then provide it a document(in your case the sentence)

Indexing:

for(String obj:keywordLst){
    client.prepareIndex("myindex", ".percolator", obj)
            .setSource(XContentFactory.jsonBuilder()
                .startObject()
                    .field("query", QueryBuilders.matchPhraseQuery("content", obj)) 
                .endObject())
            .setRefresh(true) 
    .execute().actionGet();
}

Searching:

XContentBuilder docBuilder = XContentFactory.jsonBuilder().startObject();
docBuilder.field("doc").startObject(); 
docBuilder.field("content", text);
docBuilder.endObject(); //End of the doc field
docBuilder.endObject(); //End of the JSON root object

PercolateResponse response = client.preparePercolate().setSource(docBuilder)
            .setIndices("myindex").setDocumentType("type")
            .execute().actionGet();


for(PercolateResponse.Match match : response) {
    //found matches
}
pratZ
  • 3,078
  • 2
  • 20
  • 29
0

I think the 1.2 million strings will not fit in memory or easily overflow the size limitation of your memory (consider a bad case where the average string length 256).

If some kind of pre-processing is allowed, I think you'd better first reduce the sequence of strings into a sequence of words. It means that you first convert your data into another set of data that will easily fit in memory and won't overflow.

After that, I think you can depend on the in-memory data structures such as HashMap.

Byungjoon Lee
  • 913
  • 6
  • 18
  • Can you please suggest me some techniques of reducing strings into sequence of words? In my case the dictionary is name of cities. – Anuj Mehta Jul 29 '14 at 11:21
  • I mean, as your job is to check if each word in the input sentence is in the 1.2 million strings, I think you'd better split the 1.2 million strings to make another 'dictionary'. Just simply pre-process all the strings to split them into words to save into the dictionary. In which method, the dictionary can be Map or Set, whatever. After that, using the dictionary, you can find out the words in the given sentence are in the list of 1.2 million strings just by looking up the dictionary. – Byungjoon Lee Jul 29 '14 at 23:59