How to store vocabulary in an array more effectively?

Question

I've got a vocabulary, a, abandon, ... , z.

For some reason, I will use array rather than Trie to store them.

Thus a simple method can be: wordA\0wordB\0wordC\0...word\0

But there are some more economic methods for memory I think.

Since like is a substring of likely, we can only store the first position and length of like instead of the string itself. Thus we generate a "large string" which contains every words in vocabulary and use position[i] and length[i] to get the i-th word.

For example, vocabulary contains three words ab, cd and bc. I construct abcd as the "large string".

position[0] = 0, length[0] = 2

position[1] = 2, length[1] = 2

position[2] = 1, length[2] = 2

So how to generate the "large string" is the key to this problem, are there any cool suggestions?

I think the problem is similar to TSP problem（Traveling Salesman Problem）, which is a NP problem.

Any reason to not benefit from existing database solution? Sure you can invent your own one. But, is it worth the effort? — clq, Nov 16 '15 at 09:44
@clq: I don't think relational DBs compact tables of strings down to a Trie or anything. Correct me if I'm wrong, but a DB table with a single string field as primary key would give you very fast index checks for present/absent, but probably take *more* space than storing them all back-to-back in a C string. — Peter Cordes, Nov 16 '15 at 09:53
@clq Array is more efficient and easy to handle for a pure C project. — LTzycLT, Nov 16 '15 at 09:53
@LTzycLT: More *space* efficient, but less time-efficient than using an out-of-the-box database. A DB library will give you good performance without having to spend much dev time on it. — Peter Cordes, Nov 16 '15 at 09:54
what's wrong with the trie? The structure you are talking about, seems very restrictive to me.. or lets put it different: it can only profit in very specific cases. You give an example where on word is the prefix of another one - "like" and "likely".. But what if they only *share* the prefix. For instant "likes" and "liked". How should your array store those two words efficiently? Well, a trie would be perfect for that I think... — dingalapadum, Nov 16 '15 at 10:01
Finding an *optimal* solution may well be NP complete, like you guess. Optimal would be nice, but you can always settle for "good enough", like gzip / lzma / lzop / lz4 / bzip / every other compressor does. I agree with @dingalapadum: this compression scheme doesn't appear to have any great advantage. I guess it lets you binary search without decompressing the whole dictionary, but my answer suggests some better-than-Trie data structures that are also efficient for lookup. — Peter Cordes, Nov 16 '15 at 10:10
@PeterCordes I don't even think it lets you binary search. Binary search assumes some kind of ordering. But I don't think you can guarantee such an ordering here. Consider: {yes, day, yesterday} => long word "yesterday" + markers (0,2), (0,8), (6, 8). So now, how can we check efficiently if "day" is in the dict? Or am I missing something? Besides being contiguous in memory I don't see anything useful in this structure. And what good is it "being contiguous" if everything else sucks? Interesting ops usually are: insert, delete and find, for all those ops this thing looks to me as if it sucked. — dingalapadum, Nov 16 '15 at 10:23
@dingalapadum: you can't bsearch in your example because your dictionary isn't sorted. Since you can get the `i`th word in constant time (random access to words), you can binary search if the entries in `pos / length` are in order. There's no advantage to having the pos/length arrays out of order, because every pos takes probably a 16bit `uint16_t`, and every `length` entry takes an 8bit `uint8_t`. — Peter Cordes, Nov 16 '15 at 10:31
@dingalapadum: I think the point you're missing is that the "long string" doesn't have to contain the words in-order. I.e. `pos[]` entries don't have to be monotonically increasing. — Peter Cordes, Nov 16 '15 at 10:50
@PeterCordes ah my bad.. I get now what you mean. You sort the pos/length entries with respect to the word they represent.. — dingalapadum, Nov 16 '15 at 10:50
@PeterCordes Ok. But if I get it right now, then you do a binary search over the words, when your first letter matches the word you found you'll run a search linear in the size of the word... and if you mismatch, you'll continue the binary search I suppose.. so this actually runs in O(log (number_of_words)*size_of_longest_word), right? — dingalapadum, Nov 16 '15 at 11:15
@LTzycLT I still wonder what you expect for common prefixes... liked, liken, liker, likes? How would you store these 4 words? — dingalapadum, Nov 16 '15 at 11:16
@dingalapadum: yup, if N=number of words in the dictionary, and M=length of the word you're searching for, worst-case is O(M log N), while average case is probably just O(log N), because the `memcmp(needle, bigstr[pos[i]], len[i])` will usually find a difference fast. As for common prefixes: the OP never made any claim that this helped with common prefixes. Maybe his dictionary contains a lot of words that are substrings of other words, and *not* a lot of different forms of the same word. Or maybe he just wants to try this idea, even though it probably won't work well for English. — Peter Cordes, Nov 16 '15 at 11:38
@dingalapadum I can't merge any letters from these four words in this case. So maybe it's not the best way to do compress. But the array which is the prerequisite data structure is determined, I'd like to think about other way of compress for words in array. Anyhow, thanks for your answer. — LTzycLT, Nov 16 '15 at 13:05

score 0 · Answer 1 · edited May 23 '17 at 11:44

The search keyword you're looking for is "dictionary". i.e. data structures that can be used to store a list of words, and test other strings for being present or absent in the dictionary.

Your idea is more compact than storing every word separately, but far less compact than a good data structure like a DAWG. As you note, it isn't obvious how to optimally choose how to overlap your strings. What you're doing is a bit like what a lossless compression scheme (like gzip) would do. If you don't need to check words against your compact dictionary, maybe just use gzip or LZMA to compress a sorted word list. Let their algorithms find the redundancy and represent it compactly.

I looked into dictionaries for a recent SO answer that caught my interest: Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

For a dictionary that doesn't have to have new words added on the fly, a Directed Acyclic Word Graph is the way to go. You match a string against it by following graph nodes until you either hit a point where there's no edge matching the next character, or you get to the end of your input string and find that the node in the DAWG is marked as being a valid end-of-word. (Rather than simply a substring that is only a prefix of some words). There are algorithms for building these state-machines from a simple array-of-words dictionary in reasonable time.

Your method can only take advantage of redundancy when a whole word is a substring of another word, or end-of-one, start-of-another. A DAWG can take advantage of common substrings everywhere, and is also quite fast to match words against. Probably comparable speed to binary-searching your data structure, esp. if the giant string is too big to fit in the cache. (Once you start exceeding cache size, compactness of the data structure starts to outweigh code complexity for speed.)

Less complex but still efficient is a Trie (or Radix Trie), where common prefixes are merged, but common substrings later in words don't converge again.

If you don't need to modify your DAWG or Trie at all, you can store it efficiently in a single block of memory, rather than dynamically allocating each node. You didn't say why you didn't want to use a Trie, and also didn't acknowledge the existence of the other data structures that do this job better than a plain Trie.

Did you read the OP's question ? specifically: `For some reason, I will use array rather than Trie`. He made it clear what method he wants to use and you don't answer the question: `how to generate the "large string"` — fjardon, Nov 16 '15 at 09:57
@fjardon: Oops, no, I missed that while skimming. I was just updating my answer to slightly better answer the question, or at least point out that what he wants is similar to what lossless compression schemes like gzip do (and suggest using that). Maybe look at those for inspiration. You're right that this answer doesn't directly answer the question the OP asked, but instead tries to suggest something that would solve the same problem a different way. — Peter Cordes, Nov 16 '15 at 10:02

How to store vocabulary in an array more effectively?

1 Answers1