8

I have a list of recipes obtained from a database that looks like this:

List<RecipeNode> _recipeList;

RecipeNode, among other things, has a property that references one or more tags (Such as Dinner, Breakfast, Side, Vegetarian, Holiday, and about 60 others).

   public sealed class RecipeNode
   {
      public Guid RecipeId;
      public Byte[] Tags; //Tags such as 1, 5, 6, 8, 43
      //... More stuff
   }

Finding a random recipe from _recipeList in O(1) would of course be easy, however what I need to do is find a random recipe that has, say, 5 in the Tags in O(1).

Right now, my only idea is to make an array of List<RecipeNodes>, keyed by tag. For example:

List<RecipeNode>[] _recipeListByTag;

Then, _recipeListByTag[5] would contain a list of all the recipes that have a 5 in the Tags array. I could then choose a random allowed tag and a random recipe within that tag in O(1).

The drawback of this approach is the size of this multidimensional array would be Recipes * Tags (eg, the sum of Tags.length across all recipes), which starts to take up a lot of memory since I'm storing a potentially huge number of recipes in this array. Of course, since RecipeNode is a reference type, I'm only repeating the 4byte pointers to the recipes, so this still might be the best way to go.

Is there a more efficient data structure or algorithm I could use to allow me to find a random recipe that contains a certain allowed tag? Thanks!

Mike Christensen
  • 88,082
  • 50
  • 208
  • 326
  • 2
    Why do not use simply `SQlite` ? And recover the data in constant (as much as it possible) time. – Tigran Jan 28 '12 at 21:13
  • +1 Tigran - looks like a problem suited to a DB. – Oded Jan 28 '12 at 21:14
  • It seems to me very odd that you are using a byte array for the tags, instead of something actually object-oriented, such as a list of references to tag objects. Do you really have such memory constraints that you actually have to do hacks like this? – Mike Nakis Jan 28 '12 at 21:16
  • "repeating the 4byte pointers" that depends on the implementation. How many items do you have? with not huge number of items O(log(n)) is almost O(1) – Lukasz Madon Jan 28 '12 at 21:20
  • You current implementation is not exactly O(1). Whenever you want a random entry with a particular tag, you need to translate it into the corresponding integer index. You can use hash tables for that, but it is not O(1), but O(logn). The difference is not exactly noticable, but still.... There are other cleaner ways to achieve O(logn), but I am unaware any O(1) approach. – ElKamina Jan 28 '12 at 21:22
  • 2
    @ElKamina, his approach is O(1). He is pre-building a vector of the recipes with each tag. The size of each vector is known. To pick a random element, he just needs to generate a random number from 0 to the vector length less one, and do an array indexing operation to retrieve the element at that index. Indexing into an array is O(1). – Alex D Jan 28 '12 at 21:33
  • @AlexD My question is, given a tag (eg. Holiday) how does he find the Byte tag ? That is not possible in O(1) time. – ElKamina Jan 28 '12 at 21:38
  • 1
    @ElKamina: That might be O(k) or O(lg k), it is not O(lg N) – Ben Voigt Jan 28 '12 at 21:40
  • 1
    Presumably you need to pick many recipes (more than `Recipes * Tags` if you are willing to consider that initialization to amortize into O(1) with the selection). Maybe if you described the picking operation we could suggest a way to get the entire initialization + all picking to amortize to O(1) per pick. – Ben Jackson Jan 28 '12 at 21:43
  • @ElKamina, hastable is not a sortedlist. it is exactly O(1). – L.B Jan 28 '12 at 21:49
  • Classic case of dismissing the domain. No way that the world has enough recipes to give a modern machine any kind of challenge solving this in human time. – Hans Passant Jan 28 '12 at 21:52
  • @Tigran - Unfortunately, any DB access is too slow for this algorithm. It needs to run instantly. Plus, I'm pretty sure a DB wouldn't find the data in O(1) either. – Mike Christensen Jan 28 '12 at 21:52
  • @ElKamina - Determining if a `RecipeNode` *has* a certain tag is not necessary if I index my arrays by tag. So nowhere do I need to query if recipe X has Tag y in O(1). – Mike Christensen Jan 28 '12 at 22:07
  • @BenJackson - Very perceptive of you :) This "picking operation" needs to run several hundred thousand times a second. The algorithm is not designed for finding a single recipe, but building optimal sets of recipes that meet desired criteria. Think of it as a multiple knapsack problem: "I have 12 eggs, 1lb of salmon and a head of lettuce - What 5 recipes can I make this week that will use that all up?" – Mike Christensen Jan 28 '12 at 22:14
  • 2
    @Tigran A relational database is not a magic wand. It cannot do things more efficiently just because of what it is. In fact, random selection from a relational database tends to be very inefficient unless you take special measures. – Nick Johnson Jan 28 '12 at 22:51
  • @NickJohnson: Using term `random`, as is, you **never** can **guarantee** the `O(1)` access, just because it's random. OP's: whant's `O(1)` acees (where it's possible), concerns about memory consuption, and it's request lays pretty good in relational DB concept. Yes, this is a not a magic wand: this is a suitable (imo best) solution for this problem. – Tigran Jan 28 '12 at 23:09
  • I'm pretty much sure no DB on the planet is going to do `select recipeid from Recipes ORDER BY RANDOM() LIMIT 1;` in O(1). Not to mention the overhead of making any SQL call in the first place is not O(1) either. Where-as `RecipeNode r = _recipeList[random.Next(_recipeList.Length)];` will be O(1) every time. – Mike Christensen Jan 28 '12 at 23:24
  • @MikeChristensen: so may be I missunderstood your idea of randomness. I understood that random (in your case) is not *access*, but the *parameter* you gonna query on, in order to find the suitable objects list. One time you can query for receipt with Tag=5, another time query for receipt that has -12 and so on. The example you provide in the comment is *naturaly* `O(1)` access (no doubts), but it doesn't seem to me is what asked in question, cause here you trick an index and not search paramter. May be I'm missing something... – Tigran Jan 28 '12 at 23:35
  • @Tigran - Simply put, I want to find a random recipe that has tag X in O(1), and this operation is so frequent that doing a DB call each time is not a viable option. – Mike Christensen Jan 28 '12 at 23:50

5 Answers5

8

List<RecipeNode>[] _recipeListByTag is probably the best approach for you, and its size is not Recipes * Tags because each list in the array will only contain as many recipes as match a tag, and not more. Its size would become Recipes * Tags if every single recipe contained every single tag.

If the amount of memory occupied by your data structures is so very dear to you, do not forget to call List.TrimExcess() after you have populated each list.

Mike Nakis
  • 56,297
  • 11
  • 110
  • 142
  • My apologies for not wording that right - What I meant is that if a recipe contained Tag 5 and Tag 10, a reference to it would be *repeated* in index 5 and 10. The length would be the number of relationships between a recipe and a tag. – Mike Christensen Jan 28 '12 at 21:42
  • I've decided to go this route, and have an array for each tag. Even with a million pairings, what's another 4 megs on the heap right? – Mike Christensen Jan 28 '12 at 22:55
  • I believe so. But 4 megs are nothing nowadays. – Mike Nakis Jan 29 '12 at 00:13
2

Is this homework? I doubt any real-world recipe program would require O(1) access to tags, and be too slow for using a database. I also doubt any real-world recipe would have numeric tags. Understanding the real domain can help provide a better answer.

However, if it really is about recipes and numeric tags, and if you only have 256 tags, why don't you just choose a random recipe 1 million times? The odds of not finding a recipe with the required tag are minimal, and the complexity is still O(1). If you don't like the odds, choose a random recipe 10^20 times. The complexity is still O(1).

UPDATE:

Since it's not the O(1) you're worried about, but rather the time it takes to pick a random recipe, I suggest you let your database handle this for you - the same database that holds all the recipes anyway, and the same database you're going to access to show the random recipe.

You can SELECT a random record in SQL Server this way: SQL Server Random Sort . If you're using some other database, there are other ways: http://www.petefreitag.com/item/466.cfm . Just make sure your WHERE clause has Tag=17 in it.

Community
  • 1
  • 1
zmbq
  • 38,013
  • 14
  • 101
  • 171
  • Note: I agree you're still not absolutely sure this technique will yield a recipe with the appropriate tag, but it will in any real-world case... – zmbq Jan 28 '12 at 22:08
  • Randomly choosing a recipe and then throwing it away if it doesn't match any of the desired tags will slow down the function quite a bit, especially if the user only allows 1 obscure tag (like African dishes). It's better to create an index of recipes by tag, then lookup valid recipes in O(1). I'm just wondering on the best way to do that with minimal memory usage. – Mike Christensen Jan 28 '12 at 22:18
  • I thought you were worried about complexity, not speed. If it's speed you're worried about, put the whole thing in a database and let it sort it out for you. It'll be fast enough. – zmbq Jan 28 '12 at 22:21
  • Man that would be the mother of all SQL queries :) – Mike Christensen Jan 28 '12 at 22:22
  • What? No, not the 10^20 random selections, just select the IDs of the recipes with the tag, and choose randomally from that list. You're not storing your recipes in memory the whole time, right? They're in a database *anyway*. – zmbq Jan 28 '12 at 22:23
  • I *am* storing my recipes in memory the whole time. This particular function is part of a larger algorithm that cannot be solved using a database query. The function that picks one of these random recipes needs to run several hundred thousand times a second - I'm sure not doing several hundred thousand SQL queries a second. – Mike Christensen Jan 28 '12 at 22:27
  • It's not about finding one recipe - It's about finding a set of recipes that match your search. Say I had 1,000 recipes (I have way more), and wanted to go over every set of 5. That would be somewhere around 8.6 trillion combinations to evaluate - That might take a while :) – Mike Christensen Jan 28 '12 at 22:33
  • I think you should restate your question, and ask about what you *really* want to do. Explain the situation better - what the data looks like, what search the user makes and what the user expects in return. Even if you're trying a knapsack problem (find me recipes for a week with a total of 4,000 calories) there are still easier ways to do that. – zmbq Jan 28 '12 at 22:37
  • I stated the question perfectly. I laid out the data structures and restrictions, and what I wanted to solve. I asked the exact question I wanted an answer to, and it's not my fault people choose to travel outside the boundaries of those parameters. You'll just have to trust I know the problem domain a lot better than you, and if the problem could be better solved by a few SQL queries I would have done that. With that said, `+1` for your answer as I appreciate your willingness to help. – Mike Christensen Jan 28 '12 at 22:54
1

If you want to keep the data in memory, you won't do much better than a list of (4 byte) pointers for each tag. If you can use a DB... well, others have already posted about that. Depending on how huge is "huge", you might just fork out some $$$ to add RAM to the target machine.

If you do want to keep the data in memory, but want to be ridiculously parsimonious with memory, you could try to squeeze down the 4 bytes per tag-recipe combination. For example, keep all the recipes in a big array, and (assuming you won't have more than about a million) store array indexes in 3 bytes each.

To go even further, you could divide the recipes with a given tag into equally-sized "buckets" (each occupying a contiguous area of the big array), store a starting index for each "bucket" (in 3-4 bytes), and then store a list of delta values between indexes of consecutive recipes with the given tag. Encode the delta values using an array of bytes, in such a way that a single delta value can use anything from 1-4 bytes, as required.

Because the number of recipes in a "bucket" will be limited to a constant number, retrieval using this approach is still O(1).

(I have done embedded programming on micros with as little as 256 bytes of RAM... when you do that you start thinking of very creative ways to save bytes or even bits. I'm sure that going to such lengths will not be necessary in your application, but I thought this was an interesting idea.)

Alex D
  • 29,755
  • 7
  • 80
  • 126
  • I was trying to go down these lines too, but I came to the conclusion that (memory usage wise) it wouldn't be any different than my multi-dimensional array idea. For example, if Tag 5 recipes were stored between array index 100 and 200, and Tag 6 recipes were stored between 201 and 230, a recipe that had both tags 5 *and* 6 would still need to be stored twice. – Mike Christensen Jan 28 '12 at 21:59
  • @MikeChristensen, I think you may not have completely understood my idea. With the "multi-dimensional arrays", you need a 4 byte pointer for each recipe-tag combination. This idea is a different (delta-encoded) representation of the same structure, which is designed to push the number of bytes per recipe-tag combo down to perhaps 1-2 bytes on average. If the difference between the indexes of the 1st and 2nd recipes with tag 5 is less than 255, you could store that difference in 1 byte, rather than storing 2 4-byte indexes. If it was less than 2^16, you could store it in 2 bytes, etc. – Alex D Jan 29 '12 at 10:43
  • In order to use different numbers of bytes for the delta values, you could use a scheme similar to UTF-8. (At the beginning of each multi-byte character, they use certain bits to indicate how many bytes this character occupies.) – Alex D Jan 29 '12 at 10:44
0

I would make an export from the source list to another with references to all elements that suit you. There make a random choosing and take an element from the source list, according to the reference. If there is a possibility that you again coud use the same derived list, put such lists into a greater list of them. (Of cource, the chosen algorithm depends on the real statistics of your list.)

If you use only one parameter, you could order your list by this parameter and remember in another list B of it references to where the elements with the same parameter value start. Later you could simply take random in the interval (B[4];b[5]-1). This would make the speed of a random choosing equal to O(const).

Gangnus
  • 24,044
  • 16
  • 90
  • 149
  • If I'm understanding this correctly, it seems similar to what @AlexD was saying. Storing the offsets where each group starts and ends. I think this is the same as a jagged array. – Mike Christensen Jan 28 '12 at 22:02
  • Yes, it is equivalent. But the main thought - it is impossible to propose the optimal algorithm not knowing the supposed statistics of its use. – Gangnus Jan 28 '12 at 23:52
  • If you use integer tags with few possible values, you even needn't order by tags. Better take max and min and make a list of receipts for every one position between these two. This would be sparce jagged array. – Gangnus Jan 28 '12 at 23:56
0

In this case, I would, personally, go for SQlite solution (as I, personally, know it's better then others). I see that you worry about a space, and not performance in terms of speed, but in terms of constant recovery time, you worry about flexibility of data access too. Imo, in this case, SQlite is way to go.

Design your small DB in a way you like and execute queries and joins in a way you want.

This is old but always valid example of how can you use it. There is also, naturally, ORM solution (for example LINQ driver), but to me personally it seems kind of overhead.

Hope this helps.

Tigran
  • 61,654
  • 8
  • 86
  • 123