I have a text file containing list of words (about 35 MB of data). I wrote an application that works pretty much like Scrabble helper or so. I find it insufficient to load the whole file into a set since it takes like 10 minutes to do it. I am not so experienced in C++ and thus I want to ask you what's a better way to achieve it? In my first version of application I just binary searched through it. So I managed to solve this problem by doing a binary search on a file (without loading it, just moving file pointer using seekg). But this solution isn't as fast as using maps of maps. When searching for word I look up it's first letter in a map. Then I retrieve a map of possible second letters and I do another search (for the second letter) and so on. In that way I am able to tell if the word is in dictionary much faster. How can I acheive it without loading whole file into a program to make these maps? Can I save them in a database and read them? Would that be faster?
-
Are you asking how to build this data structure w/o reading the whole file? I don't think that will work. – Scott Hunter Jun 06 '15 at 10:10
-
1Seems like you're reinventing a file indexing scheme... – Christophe Jun 06 '15 at 10:11
-
Then how do application working on huge dictionaries work? Like, suppose I wanted to apply same logic for 3 or 4 languages. I would probably run out of memory when trying to load 10 milion words. – bottaio Jun 06 '15 at 10:12
-
1It's perfectly possible. You have just to serialize the map content and reload the derialized data. Seekg positions are valid as long as you don't change the file ( or the os) – Christophe Jun 06 '15 at 10:16
-
So you suggest saving the whole structure to a file and load it back as a binary file? – bottaio Jun 06 '15 at 10:17
-
There should be no way that loading a dictionary containing 35 megabytes worth of words into a set should take so long, and I'm saying that as someone working on an i3 w/4 gigs of RAM. That should not even take a second. I would first check your build settings, make sure you are actually testing a release build, but 10 minutes sounds horrible even for a debug build. So most of all, I'd post your code loading the data into a set, since it sounds, above all, like there's a problem there that has nothing to do with the efficiency of std::set. – Jun 07 '15 at 22:43
-
That said, as others mentioned, tries can outperform std::set in this particular kind of case (personally benchmarked). But that's probably worth thinking about later after we figure out what was done so strangely in the original loader code as to make loading 35 megs of words into a set take 10 minutes. Even 10 seconds here would be abnormally slow. I've written such code over 10 years ago which had to load _gigabytes_ of text in std::sets and maps with average hardware from 10 years ago and it still took did not take anywhere close to 10 minutes (~15 secs back then for gigabytes of text). – Jun 07 '15 at 22:45
2 Answers
35MB of data is tiny. There's no problem with loading it all into memory, and no reason for it to take 10 minutes to load. If it takes so long, I suspect your loading scheme recopies maps.
However, instead of fixing this, or coming up with your own scheme, perhaps you should try something ready.
Your description sounds like you could use a database of nested structures. MongoDB, which has a C++ interface, is one possible solution.
For improved efficiency, you could go a bit fancy with the scheme. Say up to 5 letter words, you could use a multikey index. Beyond that, you could go with completely nested structure.
Just don't do it yourself. Concentrate on your program logic.

- 74,578
- 11
- 141
- 185
First, I agree with Ami that 35 MB shouldn't in principle take that long to load and store in memory. Could there be a problem with your loading code (for example accidentally copying maps, causing lots of allocation/deallocation) ?
If I understand well your intention, you build a kind of trie structure (trie and not tree) using maps of maps as you described it. This can be very nice if in memory, but if you want to load only part of the maps in memory, it'll become very difficult (not to do it technically, but to determine which maps to load, and which not to load). You'd then risk to read much more data from disk than actually needed, although there are some implementations of persistend tries around.
If your intend is to have the indexing scheme on disk, I'd rather advise you to use a traditional B-tree data structure, which is designed to optimize loading of partial indexes. You can write your own, but there are already a couple of implementations acround (see this SO question).
Now you could also go to use something like sqlite which is a lightweitght DMS that you can easily embed in your applciation.

- 1
- 1

- 68,716
- 7
- 72
- 138