Best Data structure to store and search phrases in C++

Question

I use the tries data structure to store words. Now, I have a requirement which needs , to find, given a paragraph, if certain phrases are present in the same paragraph.

What would be the most efficient way for doing this? The total number of phrases will not be more than 100.

*Which* tries are you using? There are multiple subtypes of *trie*, each having different ideal usecases. You'd need to describe your tries more precisely for us to determine whether yours are the most suitable for your purpose. — autistic, Aug 20 '15 at 04:27
Actually, scrap that... In the real world, "efficiency" is not so black and white. That which is *the most efficient* on your computer might not be on mine. You're better off solving the problem without *runtime efficiency* in mind so much, and focusing on *producing maintainable and non-repetitive code*. Once you have something running, *decide whether it's fast enough*... It doesn't matter whether it's *the most efficient* or not; if it's *fast enough*, then it's *fast enough*. If it's not *fast enough*, use your profiler to determine what the most significant bottleneck is. — autistic, Aug 20 '15 at 04:31
I'm using a basic implementation of tries. Ref: http://www.geeksforgeeks.org/trie-insert-and-search/ The issue is, this will go into production. And this piece will be called lots of times. So, fast enough is very important for my use case. I'm just looking for an implementation which can take a string which has spaces in it. — Umashankar Das, Aug 20 '15 at 04:31
If you have a question about profilers, then feel free to ask... As it is, though, not only does this seem *too broad* (hence the close reason) but it also seems unlikely to be helpful to anyone else in the future (another possible close reason). — autistic, Aug 20 '15 at 04:33
If it turns out that your implementation of tries isn't fast enough (keeping in mind the number of phrases will not be more than 100, which is your real limitation and has nothing to do with production), your profiler will highlight that, and if your code is *maintainable* you should be able to drop any other data structure in to replace it. — autistic, Aug 20 '15 at 04:36
Should I just assume, that, you wanna talk about profilers and maintainable code and hence, you wanna have this closed? — Umashankar Das, Aug 20 '15 at 04:38
@Freenode-newbostonSebivor : it is because of this way of thinking that we have Wirth's law. — v.oddou, Aug 20 '15 at 05:25
@v.oddou Not entirely. It's because of an incomplete "this way of thinking" that we have Wirth's law. If the understanding is complete, then you'll realise... Software can be profiled and optimised at any time, and if it's maintainable then that makes optimisation simple. — autistic, Aug 20 '15 at 05:29
@Freenode-newbostonSebivor I fundamentally agree with the (your) principle of stressing design simplicity, but I feel this guideline (along with Knuth statement) is taken too close to heart by the industry; and care for speed went down in priority. I feel sad that we make our machines execute millions of unneeded cycles just because we were lazy (=didn't have budget). It seems to me that profiling won't help into data structures choices. "Fast enough" may not scale well in the future. Look at what Avaaz is going through right now. — v.oddou, Aug 20 '15 at 05:44
@v.oddou So are you suggesting that with every statement of C code that we write we should be concerned by the CPU instructions it is translated into? Again, "Which way *should* we think to avoid [the consequences of] Wirth's law?"... — autistic, Aug 20 '15 at 05:57
@Freenode-newbostonSebivor I am not talking about statement, I am talking about software engineering, design and architecture. Speed, is purely a matter of managing to purify inefficiencies. (due to doing un-necessary things). This is why we call the process to "optimize", it means to get closer to the absolute perfection (optimality), the minimal amount of steps to get something done. e.g. in logistics it means avoiding to stop and restart a conveyor belt at each stamp on a product line. You need to think right from the start, at the highest level, how to arrange your code. not at line level — v.oddou, Aug 20 '15 at 06:05
@v.oddou There is no "absolute perfection". The very reason that profilers exist is because it would be unfeasable to engineer to reduce every bottleneck. Think about how our computers are engineered for more insight. We have a CPU which has memory built into it; L1, L2, L3 and registers... Then we have another level of memory, which is most commonly RAM, and we have yet another memory which is our secondary memory; anything behind a drive controller is secondary memory. Our computers swap between these all the time. The amounts of memory vary from system to system, thus, "perfection" varies. — autistic, Aug 20 '15 at 08:12
@v.oddou Perhaps on a system that has a flat hierarchy of memory, using memory that is all the same speed or even capacity, what you wrote might make sense. Such a system would be suboptimal in terms of performance or cost. There's a reason computers have evolved the way they have... — autistic, Aug 20 '15 at 08:25

score 2 · Answer 1 · answered Aug 20 '15 at 05:15

2

If I were you, I would just throw something together using boost::multi_index_container first, because then if you get even more requirements later it will be quite easy to extend it further. If later you measure and find that it is not performing adequately, then you can replace it with an optimized data structure.

answered Aug 20 '15 at 05:15

Chris Beck

15,614
4
51
87

I should've expressed this in my own answer (rather than the comments for this question)! Good answer, +1 from me :) – autistic Aug 20 '15 at 05:18

autistic · Answer 2 · 2015-08-20T05:15:29.513

The trie specified is suboptimal in numerous ways.

For a start, it constructs multiple nodes per item inserted. As the author writes, "Every character of input key is inserted as an individual trie node." That's a horrible, and unnecessary penalty! The use of an ALPHABET_SIZE greater than 2 adds insult to injury here; not only would a phrase of fifty bytes require fifty nodes, but each node would likely be over one hundred bytes in size... Each item or "phrase" of fifty bytes in length might require up to about 5KB of storage using that code! That's not even the worst of it.
The algorithm provided embeds malloc internally, making it quite difficult to optimise. Each node is its own allocation, making insert very malloc-heavy. Allocation details should be separated from data structure processing, if not for the purpose of optimisation then for simplicity of use. Programs that use this code heavily are likely to run into performance issues related to memory fragmentation and/or cache misses, with no easy or significant optimisation in sight except for substituting the trie for something else.
That's not the only thing wrong here... This code isn't too portable, either! If you end up running this on an old (not that old; they do still exist!) mainframe that uses EBCDIC rather than ASCII, this code will produce buffer overflows, and the programmer (you) will be called in to fix it. <sarcasm>That's so optimal, right?</sarcasm>

I've written a PATRICIA trie implementation that uses exactly one node per item, an alphabet size of two (it uses the bits of each character, rather than each character) and allows you to use whichever allocation you wish... alas, I haven't yet put a lot of effort into refactoring its interface, but it should be fairly close to optimal. You can find that implementation here. You can see examples of inserting (using patricia_add), retrieving (using patricia_get) and removing (using patricia_remove) in the patricia_test.c testcase file.

I give you +1 also, you clearly have quite some in-depth knowledge about trie structures — Chris Beck, Aug 20 '15 at 09:21
Interesting implementation. is this implementation character agnostic? i.e can it handle characters which require escape sequences and spaces and newlines etc. I know which system, I will run, so, EBCIDIC is not an issue. My only concern was, that, when i increase alphabet size from 26 to say 100 to support other characters then memory could become a problem. — Umashankar Das, Aug 20 '15 at 20:50
@UmashankarDas An increase in *your* alphabet size does not correspond with an increase in the computers alphabet size; computers always operate in terms of binary... Remember this. Two edges per node is all you need. — autistic, Aug 20 '15 at 21:54
You obviously are more of an academic and less of a production system guy. Sometimes it helps to talk more in practical terms rather than spouting best practices. Thanks for your time. — Umashankar Das, Aug 21 '15 at 13:27
@UmashankarDas Did you notice that one of my most popular answers is trie-related? So I know a lot about tries... Does that mean I'm not practical with my programming? Is that all you're basing this judgement from? — autistic, Aug 21 '15 at 15:15
@UmashankarDas A trie that can't handle "characters which require escape sequences and spaces and newlines etc" doesn't deserve to be used. Yes. If you want, you can even use integers or structs (keeping in mind non-deterministic padding bytes might cause problems) as keys for my trie. Just make sure you tell it how long your keys are. — autistic, Aug 21 '15 at 15:18

Best Data structure to store and search phrases in C++

2 Answers2