6

Issue

I need to check if each word of a string is spelled correctly by searching a mongoDB collection for each word.

  1. Doing a minimum amount of DB query
  2. First word of each sentence must be in upper case, but this word could be upper or lower case in the dictionary. So I need a case sensitive match for each word. Only the first word of each sentence should be case insensitive.

Sample string

This is a simple example. Example. This is another example.

Dictionary structure

Assume there is a dictionary collection like this

{ word: 'this' },
{ word: 'is' },
{ word: 'a' },
{ word: 'example' },
{ word: 'Name' }

In my case, there are 100.000 words in this dictionary. Of course names are stored in upper case, verbs are stored lower case and so on...

Expected result

The words simple and another should be recognized as 'misspelled' word as they are not existing in the DB.

An array with all existing words should be in this case: ['This', 'is', 'a', 'example']. This is upper case as it is the first word of a sentence; in DB it is stored as lower case this.

My attempt so far (Updated)

const   sentences   = string.replace(/([.?!])\s*(?= [A-Z])/g, '$1|').split('|');
let     search      = [],
        words       = [],
        existing,
        missing;

sentences.forEach(sentence => {
    const   w   = sentence.trim().replace(/[^a-zA-Z0-9äöüÄÖÜß ]/gi, '').split(' ');

    w.forEach((word, index) => {
        const regex = new RegExp(['^', word, '$'].join(''), index === 0 ? 'i' : '');
        search.push(regex);
        words.push(word);
    });
});

existing = Dictionary.find({
    word: { $in: search }
}).map(obj => obj.word);

missing = _.difference(words, existing);

Problem

  1. The insensitive matches don't work properly: /^Example$/i will give me a result. But in existing there will go the original lowercase example, that means Example will go to missing-Array. So the case insensitive search is working as expected, but the result arrays have a missmatch. I don't know how to solve this.
  2. Optimizing the code possible? As I'm using two forEach-loops and a difference...
user3142695
  • 15,844
  • 47
  • 176
  • 332
  • 1
    @Liam. Yes it is. (meteor application). Tag added. – user3142695 Dec 06 '16 at 09:55
  • so, the real problem is the upper/lowercase mismatch ? – Derlin Dec 06 '16 at 10:10
  • for case insensitive $in search, have a look at http://stackoverflow.com/questions/27363000/mongo-in-query-with-case-insensitivity – Derlin Dec 06 '16 at 10:18
  • 1
    Compare: http://stackoverflow.com/questions/22931177/mongo-db-sorting-with-case-insensitive (TL;DR mongoDB 3.4+ supports case-insensitive indexes) – Tomalak Dec 06 '16 at 10:29
  • You can have two different indexes on the same field, can't you (I'm just assuming that here) – Tomalak Dec 06 '16 at 10:32
  • 1
    Also, consider using more advanced tools than a naive regex with some umlatus to tokenize your input text. (This looks promising: https://github.com/NaturalNode/natural) – Tomalak Dec 06 '16 at 10:52

1 Answers1

0

This is how I would face this issue:

  • Use regex to get each word after space (including '.') in an array.

    var words = para.match(/(.+?)(\b)/g); //this expression is not perfect but will work
    
  • Now add all words from your collection in an array by using find(). Lets say name of that array is wordsOfColl.

  • Now check if words are in the way you want or not

    var prevWord= ""; //to check first word of sentence
    
    words.forEach(function(word) {
        if(wordsOfColl.toLowerCase().indexOf(word.toLowerCase()) !== -1) {
           if(prevWord.replace(/\s/g, '') === '.') {
              //this is first word of sentence
              if(word[0] !== word[0].toUpperCase()) {
                 //not capital, so generate error
              }
            } 
           prevWord = word;
         } else {
           //not in collection, generate error
         }
    });
    

I haven't tested it so please let me know in comments if there's some issue. Or some requirement of yours I missed.

Update

As author of question suggested that he don't want to load whole collection on client, you can create a method on server which returns an array of words instead of giving access to client of collection.

Mukul Jain
  • 1,121
  • 11
  • 24
  • I don't think this is a good way, because my collection has 100.000 documents. It doesn't make sense to load 100.000 documents just to check for a dozen words... – user3142695 Dec 06 '16 at 13:12
  • but its better than sending each word for checking to server or even sending whole sentence to server for checking. I know its looking that 100K docs will slow down client but they won't. Minimongo (client's Mongo) can handle that much records if docs are small which are in your case. And you can subscribe to that collection for that route. – Mukul Jain Dec 06 '16 at 13:17
  • Hmm... I don't see why subscribing to 100k documents should be better then sending 10 sentences to the server. Please explain. – user3142695 Dec 06 '16 at 13:22
  • But still... 100.000 words with avrg. 10 characters are 10^6 bits, which are 122 kBytes. Just for checking a few sentences :-) – user3142695 Dec 06 '16 at 13:25
  • Well then you can do that logic on server too. This logic is client based. – Mukul Jain Dec 06 '16 at 13:29
  • I'm sorry, but that is completely wrong. Let me explain: As you can see in my post I'm doing the search for relevant words as a query. That means it is a completly database work. I'm sending 10 sentences to the server (assume 100 characters), which would be 0.12 kBytes. As a result the server sends me the missing or existing words, which would be max. 0.12 kBytes. That's all, because everything is done in the DB lookup. So 0.24 kBytes is a huge difference to 122 kBytes. – user3142695 Dec 06 '16 at 20:59