5

Write a function which has:

input: array of pairs (unique id and weight) length of N, K =< N  
output: K random unique ids (from input array)  

Note: being called many times frequency of appearing of some Id in the output should be greater the more weight it has. Example: id with weight of 5 should appear in the output 5 times more often than id with weight of 1. Also, the amount of memory allocated should be known at compile time, i.e. no additional memory should be allocated.

My question is: how to solve this task?

EDIT
thanks for responses everybody!
currently I can't understand how weight of pair affects frequency of appearance of pair in the output, can you give me more clear, "for dummy" explanation of how it works?

Ivan
  • 19,560
  • 31
  • 97
  • 141
x__dos
  • 1,813
  • 3
  • 27
  • 47
  • What is your question here exactly? – javamonkey79 Nov 30 '10 at 20:57
  • Is the order of result ids important? I.e., would [1, 2] and [2, 1] be considered different results? – Nikita Rybak Nov 30 '10 at 20:59
  • 1
    @Sergey, @javamonkey79 Request might not be stated explicitly, but I think it's pretty evident: answer this interview question. – Nikita Rybak Nov 30 '10 at 21:00
  • @Nikita not important, [1, 2] is the same as [2, 1] – x__dos Nov 30 '10 at 21:02
  • 3
    There is a little bit of contradiction going on in the requirement. It states that K <= N. But as K approaches N, the frequency requirement will be contradicted by the Uniqueness requirement. Worst case, if K=N, all elements will be returned (i.e appear with same frequency), irrespective of their weight. – Axn Nov 30 '10 at 21:49
  • @Axn: good catch. I think I disbelieve that "unique" is being used as an adjective describing the output. It's just that the things being selected are (for some reason) called "unique ids". Two "unique ids" can be equal, if they just so happen to be the same "unique id", if you see what I mean. – Steve Jessop Nov 30 '10 at 21:55
  • @Steve That would make a lot of sense, wouldn't it :) Perhaps OP could clarify whether the same data element (let's get rid of the "unique id" term) can appear multiple times in the same run. If so, there would be no need to remove already-selected items, and the logic becomes much simpler – Axn Nov 30 '10 at 22:12
  • 1
    Correction: There are *no* intersting interview questions. – John Dibling Nov 30 '10 at 22:29
  • @kilonet: is [3, 3] a valid output, or must the output ids be distinct? It's unclear what you mean by "output: K random unique Ids" – Laurence Gonsalves Nov 30 '10 at 22:34
  • @Laurence output can't contain any Id twice – x__dos Nov 30 '10 at 22:50
  • @Nikita Rybak, I think this was understood by all. Blatantly asking to solve a problem without even attempting it doesn't work well on SO. I was actually expecting this question to be closed. –  Nov 30 '10 at 23:15

5 Answers5

7

Assuming a good enough random number generator:

  • Sum the weights (total_weight)
  • Repeat K times:
    • Pick a number between 0 and total_weight (selection)
    • Find the first pair where the sum of all the weights from the beginning of the array to that pair is greater than or equal to selection
    • Write the first part of the pair to the output

You need enough storage to store the total weight.

MSN
  • 53,214
  • 7
  • 75
  • 105
  • I like this one as it uses no aditional memory, as stated in the problem. – skajfes Nov 30 '10 at 22:05
  • They want K unique random ids. This algorithm could return K of the same id. – Victor Parmar Nov 30 '10 at 22:05
  • But they also say: "...appearing of some Id in the output should be greater the more weight it has". Or what if the array has 3 elements and K is equal to 4? Or what if all but one elements have weight 0? So +1 for this answer. – Jan Nov 30 '10 at 22:19
  • can't see how to ensure the uniqueness of Ids in the output @Jan K less than or equal to N – x__dos Nov 30 '10 at 22:44
  • @MSN, @kilonet: we can build on this though (and very easily too), when you select an id, remove its weight from the total and then set it to 0 in the array (or shuffle the elements to override the pair). – Matthieu M. Dec 01 '10 at 08:05
5

Ok so you are given input as follows:

(3, 7)
(1, 2)
(2, 5)
(4, 1)
(5, 2)

And you want to pick a random number so that the weight of each id is reflected in the picking, i.e. pick a random number from the following list:

3 3 3 3 3 3 3 1 1 2 2 2 2 2 4 5 5

Initially, I created a temporary array but this can be done in memory as well, you can calculate the size of the list by summing all the weights up = X, in this example = 17

Pick a random number between [0, X-1], and calculate which which id should be returned by looping through the list, doing a cumulative addition on the weights. Say I have a random number 8

(3, 7) total = 7 which is < 8
(1, 2) total = 9 which is >= 8 **boom** 1 is your id!

Now since you need K random unique ids you can create a hashtable from initial array passed to you to work with. Once you find an id, remove it from the hash and proceed with algorithm. Edit Note that you create the hashmap initially only once! You algorithm will work on this instead of looking through the array. I did not put in in the top to keep the answer clear

As long as your random calculation is not using any extra memory secretly, you will need to store K random pickings, which are <= N and a copy of the original array so max space requirements at runtime are O(2*N)

Asymptotic runtime is :

O(n) : create copy of original array into hastable +
(
   O(n) : calculate sum of weights +
   O(1) : calculate random between range +
   O(n) : cumulative totals
) * K random pickings
= O(n*k) overall

This is a good question :)

Victor Parmar
  • 5,719
  • 6
  • 33
  • 36
  • 1
    @Vic doesn't "create copy of original array into hastable" break the requirement " ammount of memory allocated should be known at compile time, i.e. no additional memory should be allocated" – Conrad Frix Nov 30 '10 at 21:43
  • 2
    It's O(NK), and you're allocating memory for the hash table which OP says isn't allowed (I presume he means O(1) space). – moinudin Nov 30 '10 at 21:43
  • Well they want K numbers returned, where are you going to put them? I think the idea is that they want to have a constant maximum amount of memory that the function can allocate and it won't grow beyond the hashmap + the array to store the ids, which should be O(2*N). – Victor Parmar Nov 30 '10 at 21:50
  • @vic O(1) space does not restrict you from allocating anything. Your hashmap grows with the size of the input though. And you typically exclude input and output data in space complexity analysis. – moinudin Nov 30 '10 at 21:54
  • @marcog fair enough, I was working on the assumption that if I guarantee that I don't allocate more that N ids into the hashmap and if I know how the map is constructed that would guarantee memory requirements. Am I wrong in this? – Victor Parmar Nov 30 '10 at 21:57
  • @vic Getting very technical here (and only because this is an interview question): Hashmap is going to consume more memory than an array of K numbers, so technically you're wrong. Not that it matters 99% of the time in the real world. – moinudin Nov 30 '10 at 22:00
  • At the worst case you can modify the original array but that would make it O(n*n) runtime :( – Victor Parmar Nov 30 '10 at 22:01
  • @marcog agreed, but is the amount of extra memory required a constant factor of the number of items added? I think it depends on the implementation but that should be the case for most library implementations no? – Victor Parmar Nov 30 '10 at 22:03
  • But you have to agree this is a great interview question =) – Victor Parmar Nov 30 '10 at 22:04
  • @vic Depends on hash table implementation, but it will usually be roughly a constant factor. What O(1) needs is a constant overhead, not constant factor though. It a good interview question, although I've come to dislike (from an interviewing perspective) the heavily lightbulb-moment questions like this (even though I do very well at them). – moinudin Nov 30 '10 at 22:12
  • This is at least O(K*N), not O(N) overall. You're doing things that are O(N) K times. – Laurence Gonsalves Nov 30 '10 at 22:56
3

This solution works with non-integer weights and uses constant space (ie: space complexity = O(1)). It does, however modify the input array, but the only difference in the end is that the elements will be in a different order.

  • Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.

  • sum_weights = the sum of all of the weights, and n = N.

  • K times:

    • Choose a random number r in the range [0,sum_weights)

    • binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.

    • Add input[i].id to output.

    • Subtract input[i-1].weight from input[i].weight (unless i == 0). Now subtract input[i].weight from to following (> i) input weights and also sum_weight.

    • Move input[i] to position [n-1] (sliding the intervening elements down one slot). This is the expensive part, as it's O(N) and we do it K times. You can skip this step on the last iteration.

    • subtract 1 from n

  • Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight

Time complexity is O(K*N). The expensive part (of the time complexity) is shuffling the chosen elements. I suspect there's a clever way to avoid that, but haven't thought of anything yet.

Update

It's unclear what the question means by "output: K random unique Ids". The solution above assumes that this meant that the output ids are supposed to be unique/distinct, but if that's not the case then the problem is even simpler:

  • Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.

  • sum_weights = the sum of all of the weights, and n = N.

  • K times:

    • Choose a random number r in the range [0,sum_weights)

    • binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.

    • Add input[i].id to output.

  • Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight

Time complexity is O(K*log(N)).

Laurence Gonsalves
  • 137,896
  • 35
  • 246
  • 299
  • 1
    O(N) space complexity isn't allowed. – Steve Jessop Nov 30 '10 at 21:51
  • If the random number is truly random, you don't need to adjust the weights. – MSN Nov 30 '10 at 21:53
  • @Steve Jessop: That wasn't entirely clear from the question (is N known at compile time?) but I was already adding a note on how to make this O(1) space complexity concurrent with you posting your comment. – Laurence Gonsalves Nov 30 '10 at 21:55
  • @MSN: No. You need to adjust the weights because the outputs are supposed to be unique. From the question: "output: K random unique Ids". If you flip a fair ("truly random") coin twice there's still a good chance that you'll get either two heads or two tails. – Laurence Gonsalves Nov 30 '10 at 21:59
  • @Laurence: but as Axn says, if you adjust the weights then you don't satisfy the criterion that each ID is output with frequency proportional to its weight (frequency measured over multiple trials, one assumes). The question's a mess, basically. – Steve Jessop Nov 30 '10 at 22:07
  • @Steve Jessop: I do. The weights are only adjusted for this one invocation of the function and the adjustment is only to ensure that the same id is not selected again in this call. There is definitely some imprecise language regarding what the weights mean, but I think it's still pretty clear what was intended: Each element in the output is selected from the inputs (that haven't already been selected for the outputs) with a probability proportional to its weight. – Laurence Gonsalves Nov 30 '10 at 22:20
  • @Steve Jessop: BTW: I've rewritten the answer to remove the non-constant storage version. – Laurence Gonsalves Nov 30 '10 at 22:21
  • Regarding "move to end of array" - if you're modifying all the weights after the selected item anyway, then just swap the selected item with the last unselected item (and update the latter's weight). Instead of subtracting `input[i].weight` from each succeeding weight, subtract `input[i].weight - input[max].weight`. – Steve Jessop Nov 30 '10 at 22:23
  • @Laurence: you may well be right about the meaning of the question. I'm a bit bored actually of people who pose interview questions and then disappear. In a real interview you can compensate for the interviewer's ambiguous or self-contradictory demands by letting them make a second attempt at asking the question ;-) – Steve Jessop Nov 30 '10 at 22:25
  • @Steve: Yeah, I'd considered swapping, but either way you end up having to modify the weights of all of the elements between i and n, and swapping is more complicated. I don't think your description is correct/complete -- you need to have 3 separate adjustments, one for the element that was at n-1, one for the selected element that was at i, and a third for all of the intervening elements. – Laurence Gonsalves Nov 30 '10 at 22:32
  • @Steve: actually, on re-reading the question, and your comments above, I think it's entirely possible that your interpretation was right as well. If that's the case, the problem is much simpler. – Laurence Gonsalves Nov 30 '10 at 22:35
  • @Laurence: Yes, I've skipped the details a bit. I didn't say *how* to update the weight of the item that started out at the end, and is moved to the middle. If it is at `n` and is about to move to `i`, set its weight to `(input[n].weight - input[n-1].weight) + input[i-1].weight`, with suitable bounds checking. And in my first formula above I've called it "weight", but it should be the original weight of the item, i.e. the consecutive difference at that point, not the partial sum currently stored as the `weight`. – Steve Jessop Nov 30 '10 at 22:38
  • @Laurence, well, then, this problem is just woefully underspecified. Ugh. – MSN Nov 30 '10 at 23:17
2

My short answer: in no way.

Just because the problem definition is incorrect. As Axn brilliantly noticed:

There is a little bit of contradiction going on in the requirement. It states that K <= N. But as K approaches N, the frequency requirement will be contradicted by the Uniqueness requirement. Worst case, if K=N, all elements will be returned (i.e appear with same frequency), irrespective of their weight.

Anyway, when K is pretty small relative to N, calculated frequencies will be pretty close to theoretical values.

The task may be splitted on two subtasks:

  1. Generate random numbers with a given distribution (specified by weights)
  2. Generate unique random numbers

Generate random numbers with a given distribution

  1. Calculate sum of weights (sumOfWeights)
  2. Generate random number from the range [1; sumOfWeights]
  3. Find an array element where the sum of weights from the beginning of the array is greater than or equal to the generated random number

Code

#include <iostream>
#include <cstdlib>
#include <ctime>

// 0 - id, 1 - weight
typedef unsigned Pair[2];

unsigned Random(Pair* i_set, unsigned* i_indexes, unsigned i_size)
{
   unsigned sumOfWeights = 0;
   for (unsigned i = 0; i < i_size; ++i)
   {
      const unsigned index = i_indexes[i];
      sumOfWeights += i_set[index][2];
   }

   const unsigned random = rand() % sumOfWeights + 1;

   sumOfWeights = 0;
   unsigned i = 0;
   for (; i < i_size; ++i)
   {
      const unsigned index = i_indexes[i];
      sumOfWeights += i_set[index][3];
      if (sumOfWeights >= random)
      {
         break;
      }
   }

   return i;
}

Generate unique random numbers

Well known Durstenfeld-Fisher-Yates algorithm may be used for generation unique random numbers. See this great explanation.

It requires N bytes of space, so if N value is defined at compiled time, we are able to allocate necessary space at compile time.

Now, we have to combine these two algorithms. We just need to use our own Random() function instead of standard rand() in unique numbers generation algorithm.

Code

template<unsigned N, unsigned K>
void Generate(Pair (&i_set)[N], unsigned (&o_res)[K])
{
   unsigned deck[N];
   for (unsigned i = 0; i < N; ++i)
   {
      deck[i] = i;
   }

   unsigned max = N - 1;

   for (unsigned i = 0; i < K; ++i)
   {
      const unsigned index = Random(i_set, deck, max + 1);

      std::swap(deck[max], deck[index]);
      o_res[i] = i_set[deck[max]][0];
      --max;
   }
}

Usage

int main()
{
   srand((unsigned)time(0));

   const unsigned c_N = 5;    // N
   const unsigned c_K = 2;    // K
   Pair input[c_N] = {{0, 5}, {1, 3}, {2, 2}, {3, 5}, {4, 4}}; // input array
   unsigned result[c_K] = {};

   const unsigned c_total = 1000000; // number of iterations
   unsigned counts[c_N] = {0};       // frequency counters

   for (unsigned i = 0; i < c_total; ++i)
   {
      Generate<c_N, c_K>(input, result);
      for (unsigned j = 0; j < c_K; ++j)
      {
         ++counts[result[j]];
      }
   }

   unsigned sumOfWeights = 0;
   for (unsigned i = 0; i < c_N; ++i)
   {
      sumOfWeights += input[i][1];
   }

   for (unsigned i = 0; i < c_N; ++i)
   {
      std::cout << (double)counts[i]/c_K/c_total  // empirical frequency
                << " | "
                << (double)input[i][1]/sumOfWeights // expected frequency
                << std::endl;
   }

   return 0;
}

Output

N = 5, K = 2

      Frequencies
Empiricical | Expected
 0.253813   | 0.263158
 0.16584    | 0.157895
 0.113878   | 0.105263
 0.253582   | 0.263158
 0.212888   | 0.210526

Corner case when weights are actually ignored

N = 5, K = 5

      Frequencies
Empiricical | Expected
 0.2        | 0.263158
 0.2        | 0.157895
 0.2        | 0.105263
 0.2        | 0.263158
 0.2        | 0.210526
Community
  • 1
  • 1
Stas
  • 11,571
  • 9
  • 40
  • 58
  • If there was a reversal badge for merely mediocre questions instead of outright bad ones, this answer would deserve one. – StevenWilkins Dec 16 '10 at 01:30
1

I do assume that the ids in the output must be unique. This makes this problem a specific instance of random sampling problems.

The first approach that I can think of solves this in O(N^2) time, using O(N) memory (The input array itself plus constant memory). I Assume that the weights are possitive.

Let A be the array of pairs.

1) Set N to be A.length

2) calculate the sum of all weights W.

3) Loop K times

   3.1) r = rand(0,W)

   3.2) loop on A and find the first index i such that A[1].w + ...+ A[i].w <= r < A[1].w + ... + A[i+1].w

   3.3) add A[i].id to output

   3.4) A[i] = A[N-1] (or swap if the array contents should be preserved)

   3.5) N = N - 1

   3.6) W = W - A[i].w

Eyal Schneider
  • 22,166
  • 5
  • 47
  • 78