7

I have a dictionary that consists of words and their phonetic transcriptions. The words are all lower case, so there is not case-sensitive search involved.

The lexicon is really huge, and I need to load it quickly when my application starts. I would prefer reading it without having to read each entry separately.

I guess the way I stored and load it also affects how I keep the lexicon in memory

Thank you for any ideas.

tmighty
  • 10,734
  • 21
  • 104
  • 218
  • 1
    How huge is "really huge"? Do you plan on loading the whole lexicon in your application's memory, or reading it from a file or database? Also, what types of operations will the structure need to do efficiently? Mainly lookup, or enumeration as well? – Taylor Brandstetter May 21 '13 at 15:41
  • Really huge meaning 200.000 words. I would like to load it into memory. I only need to look up words, no writing or displaying. – tmighty May 21 '13 at 17:03
  • Do you need to search with "typos" and wild-chars ? – Martin Perry May 21 '13 at 17:13
  • @MartinPerry No, just a 1:1 lookup. – tmighty May 22 '13 at 08:01

2 Answers2

4

You probably want to store this as a Trie

This is an efficient way of storing a dictionary. Look at the following answers for more information

http://en.wikipedia.org/wiki/Trie

https://stackoverflow.com/questions/296618/what-is-the-most-common-use-of-the-trie-data-structure

Persisting a trie to a file - C

Community
  • 1
  • 1
Salgar
  • 7,687
  • 1
  • 25
  • 39
  • 1
    Note that unless special care is taken, a trie will have fairly significant memory requirements. – danielschemmel May 21 '13 at 15:49
  • 1
    Though when done properly, a trie is the probably most efficient way to store a dictionary, thanks to prefix compression. – Damon May 21 '13 at 17:00
4

A few options come to mind:

  1. You could use sqlite, which uses mmap to map the file to memory, to store the lexicon so only what is accessed gets read. This is probably reasonable fast and reliable as well as the easiest to implement.
  2. You can mmap the file yourself
  3. Use seek operations to move the file pointer through the file without reading the whole thing. This will only help if the lexicon is structured in some way so you can find the right position without reading everything, i.e. it has to be a data structure that allows better than O(n) searching (a Trie usually being a good choice, as suggested by Salgar).
smocking
  • 3,689
  • 18
  • 22
  • Let's say I memory-map the file, and I know at which positions words start (for example: "a" words start at pos 1, "b" words start at pos 93229), how would I structure my file? Do I have to work with fixed lengths or what did you mean by mmapping the file? – tmighty May 21 '13 at 16:57
  • My application is plain C++ code without any third party libraries, and although I love SQLite, I would opt for not using it in this case. – tmighty May 21 '13 at 17:01
  • Combine the two answers, and mmap a trie. – Oktalist May 21 '13 at 17:06
  • If you know exactly where to look: great, that should work in terms of structure (or you can use a trie). `mmap` maps part of the file to memory, so you can access it like an array. Check out the manpage. It also has a good example of how to do it. Be aware that this low-level approach is not for the faint of heart though, because you have to page-align the offset among other things. – smocking May 21 '13 at 17:09
  • @smocking I am not sure how mmap'ing a dictionary would work. Even though I can access the bytes quickly, how would I search for an entry? I would have to iterate through the entire map, I think. That would not really make sense in my opinion. – tmighty May 29 '13 at 17:29
  • @tmighty, in your first comment you said you already know where to start looking, so why would you need to iterate through the entire structure? – smocking May 30 '13 at 17:52