I'm a little bit surprised that nobody came up with the idea of serialization yet. Boost provides great support for such a solution. If I understood you correctly, the problem is that it takes too long to read in your list of words (and put them into a data structure that hopefully provides fast look-up operations), whenever you use your application. Building up such a structure, then saving it into a binary file for later reuse should improve the performance of your application (based on the results presented below).
Here is a piece of code (and a minimal working example, at the same time) that might help you out on this.
#include <chrono>
#include <fstream>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>
#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/serialization/set.hpp>
#include "prettyprint.hpp"
class Dictionary {
public:
Dictionary() = default;
Dictionary(std::string const& file_)
: _file(file_)
{}
inline size_t size() const { return _words.size(); }
void build_wordset()
{
if (!_file.size()) { throw std::runtime_error("No file to read!"); }
std::ifstream infile(_file);
std::string line;
while (std::getline(infile, line)) {
_words.insert(line);
}
}
friend std::ostream& operator<<(std::ostream& os, Dictionary const& d)
{
os << d._words; // cxx-prettyprint used here
return os;
}
int save(std::string const& out_file)
{
std::ofstream ofs(out_file.c_str(), std::ios::binary);
if (ofs.fail()) { return -1; }
boost::archive::binary_oarchive oa(ofs);
oa << _words;
return 0;
}
int load(std::string const& in_file)
{
_words.clear();
std::ifstream ifs(in_file);
if (ifs.fail()) { return -1; }
boost::archive::binary_iarchive ia(ifs);
ia >> _words;
return 0;
}
private:
friend class boost::serialization::access;
template <typename Archive>
void serialize(Archive& ar, const unsigned int version)
{
ar & _words;
}
private:
std::string _file;
std::set<std::string> _words;
};
void create_new_dict()
{
std::string const in_file("words.txt");
std::string const ser_dict("words.set");
Dictionary d(in_file);
auto start = std::chrono::system_clock::now();
d.build_wordset();
auto end = std::chrono::system_clock::now();
auto elapsed =
std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Building up the dictionary took: " << elapsed.count()
<< " (ms)" << std::endl
<< "Size of the dictionary: " << d.size() << std::endl;
d.save(ser_dict);
}
void use_existing_dict()
{
std::string const ser_dict("words.set");
Dictionary d;
auto start = std::chrono::system_clock::now();
d.load(ser_dict);
auto end = std::chrono::system_clock::now();
auto elapsed =
std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Loading in the dictionary took: " << elapsed.count()
<< " (ms)" << std::endl
<< "Size of the dictionary: " << d.size() << std::endl;
}
int main()
{
create_new_dict();
use_existing_dict();
return 0;
}
Sorry for not putting the code into separated files and for the poor design; however, for demonstrating purposes it should be just enough.
Note that I didn't use a map: I just don't see the point of storing a lot of zeros or anything else unnecessarily. AFAIK, a std::set
is backed by the same powerful RB-Tree as std::map
s.
For the data set available here (it contains around 466k words), I got the following results:
Building up the dictionary took: 810 (ms)
Size of the dictionary: 466544
Loading in the dictionary took: 271 (ms)
Size of the dictionary: 466544
Dependencies:
Hope this helps. :) Cheers.