How can I easily and quickly store a big word database?

Question

I am currently working as a school project in developing a spelling checker in C++. For the part which consists in checking if a word exists, I currently do the following :

I found online a .txt file with all the english existing words
My script starts by going through this text files and placing each of its entry in a map object, for an easy access.

The problem with this approach is that when the program starts, the step 2) takes approximately 20 secs. This is not a big deal in itself but I was wondering if any of you had an idea of an other approach to have my database of words quickly available. For instance, would there be a way to store the map object in a file, so that I don't need to build it from the text file every time ?

I didn't find any distributed set of all the english words in an other format than text .. Did you think of a particular one ? Thx for your help — paupaulaz, Jan 27 '18 at 11:39
@FrankS101 actually that is what I would like to do, but is there a way to save a map into the static memory ? — paupaulaz, Jan 27 '18 at 11:42
@Paulo: Your application should still use an SQLite database. You just need to perform a separate step to insert the values from the text file into the database, possibly using an editor like Notepad++ and a regular expression search to turn the lines into SQL INSERT statements. — Christian Hackl, Jan 27 '18 at 11:51

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

1

If your file with all English words is not dynamic, you can just store it in a static map. To do so, you need to parse your .txt file, something like:

alpha

beta

gamma

...

to convert it to something like this:

static std::map<std::string,int> wordDictionary = {
                { "alpha", 0 },
                { "beta", 0 },
                { "gamma", 0 } 
                   ... };

You can do that programatically or simply with find and replace in your favourite text editor.

Your .exe is going to be much heavier than before, but it will also start much faster than reading this info from file.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 27 '18 at 12:33

FrankS101

2,112
6
26
40

I am curious. "static (memory/map)" ≈ "hard code", right? – cppBeginner Jan 27 '18 at 12:43
1

@cppBeginner Exactly. – FrankS101 Jan 27 '18 at 12:45
Thank, I also find more about "static memory" : https://stackoverflow.com/questions/408670/stack-static-and-heap-in-c and https://stackoverflow.com/questions/31477308/is-there-any-difference-between-static-memory-allocation-and-automatic-memory-al – cppBeginner Jan 27 '18 at 13:23

mindthegap · Accepted Answer · 2018-01-28T14:55:56.817

I'm a little bit surprised that nobody came up with the idea of serialization yet. Boost provides great support for such a solution. If I understood you correctly, the problem is that it takes too long to read in your list of words (and put them into a data structure that hopefully provides fast look-up operations), whenever you use your application. Building up such a structure, then saving it into a binary file for later reuse should improve the performance of your application (based on the results presented below).

Here is a piece of code (and a minimal working example, at the same time) that might help you out on this.

#include <chrono>
#include <fstream>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/serialization/set.hpp> 

#include "prettyprint.hpp"

class Dictionary {
public:
  Dictionary() = default;
  Dictionary(std::string const& file_)
    : _file(file_)
  {}

  inline size_t size() const { return _words.size(); }

  void build_wordset()
  {
    if (!_file.size()) { throw std::runtime_error("No file to read!"); }

    std::ifstream infile(_file);
    std::string line;

    while (std::getline(infile, line)) {
      _words.insert(line);
    }
  }

  friend std::ostream& operator<<(std::ostream& os, Dictionary const& d)
  {
    os << d._words;  // cxx-prettyprint used here
    return os;
  }

  int save(std::string const& out_file) 
  { 
    std::ofstream ofs(out_file.c_str(), std::ios::binary);
    if (ofs.fail()) { return -1; }

    boost::archive::binary_oarchive oa(ofs); 
    oa << _words;
    return 0;
  }

  int load(std::string const& in_file)
  {
    _words.clear();

    std::ifstream ifs(in_file);
    if (ifs.fail()) { return -1; }

    boost::archive::binary_iarchive ia(ifs);
    ia >> _words;
    return 0;
  }

private:
  friend class boost::serialization::access;

  template <typename Archive>
  void serialize(Archive& ar, const unsigned int version)
  {
    ar & _words;
  }

private:
  std::string           _file;
  std::set<std::string> _words;
};

void create_new_dict()
{
  std::string const in_file("words.txt");
  std::string const ser_dict("words.set");

  Dictionary d(in_file);

  auto start = std::chrono::system_clock::now();
  d.build_wordset();
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Building up the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;

  d.save(ser_dict);
}

void use_existing_dict()
{
  std::string const ser_dict("words.set");

  Dictionary d;

  auto start = std::chrono::system_clock::now();
  d.load(ser_dict);
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Loading in the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;
}

int main()
{
  create_new_dict();
  use_existing_dict();
  return 0;
}

Sorry for not putting the code into separated files and for the poor design; however, for demonstrating purposes it should be just enough.

Note that I didn't use a map: I just don't see the point of storing a lot of zeros or anything else unnecessarily. AFAIK, a std::set is backed by the same powerful RB-Tree as std::maps.

For the data set available here (it contains around 466k words), I got the following results:

Building up the dictionary took: 810 (ms)
Size of the dictionary: 466544
Loading in the dictionary took: 271 (ms)
Size of the dictionary: 466544

Dependencies:

Boost's Serialization component (however, I used version 1.58).
louisdx/cxx-prettyprint.

Hope this helps. :) Cheers.

Thanks a lot for this @laszlzso, I am going to dive into it, but according to the compilation time you mentioned, it seems like a top solution ! — paupaulaz, Jan 28 '18 at 14:50
Your very welcome. Note that it's not a compilation time - first, it's the amount of time (measured in milliseconds) it took to build up the set of words; second time, what it took to deserialize (load in from the binary file) the saved set. — mindthegap, Jan 28 '18 at 14:54

Andrew Dwojc · Answer 3 · 2018-01-27T16:18:13.490

First things first. Do not use a map (or a set) for storing a word list. Use a vector of strings, make sure its contents are sorted (I would believe your word list is already sorted), and then use binary_find from the <algorithm> header to check if a word is already in the dictionary.

Although this may still be highly suboptimal (depending on whether your compiler does a small string optimisation), your load times will improve by at least an order of magnitude. Do a benchmark and if you want to make it faster still, post another question on the vector of strings.

How can I easily and quickly store a big word database?

3 Answers3