Removing items from unevenly distributed set

Question

I have a website where users submit questions (zero, one or multiple per day), vote on them and answer one question per day (more details here). A user can see the question only once either by submitting, voting or answering it.

I have a pool of questions that players have already seen. I need to remove 30 questions from the pool each month. I need to pick questions to remove in such way that I maximize the number of available questions left in the pool for player with least available questions.

Example with pool of 5 questions (and need to remove 3):

player A has seen questions 1, 3 and 5
player B has seen questions 1 and 4
player C has seen questions 2 and 4

I though about removing the questions that top player has seen, but the position would change. Following the above example, player A has only got 2 questions left to play (2 and 4). However, if I remove 1, 3 and 5, the situation would be:

player A can play questions 2 and 4
player B can play question 2
player C cannot play anything because 1,3,5 are removed and he has already seen 2 and 4.

The score for this solution is zero, i.e. the player with least amount of available questions has zero available questions to play.

In this case it would be better to remove 1, 3 and 4, giving:

player A can play question 2
player B can play questions 2 and 5
player C can play question 5

The score for this solution is one, because the two players with least amount of available questions to play have one available question.

If the data size was small, I would be able to brute-force the solution. However, I have hundreds of players and questions, so I'm looking for some algorithm to solve this.

Why removing questions at all? Why removing exactly 30 questions per month? — Dialecticus, Jan 05 '12 at 18:58
There are two groups of player accounts: players that play for rankings and player who just play for fun. Players who play for rankings need to get unique questions, that's why the hiding is needed. Each month 30 questions are removed from ranked-players-pool into public-for-fun-pool. So, there is not really 30, but as many as there are days in a given month. — Milan Babuškov, Jan 07 '12 at 22:38
Why not just removing the oldest question from ranked-players-pool? — Dialecticus, Jan 08 '12 at 09:02
The fact that they are oldest does not mean that is has been seen by most of the players. In fact, the age of question is completely unrelated to the problem. It could easily happen that question added on day 30 is seen by all players on that day, while question added at day 1 was only seen by 4-5 players. — Milan Babuškov, Jan 09 '12 at 00:51
Why does it matter what question is seen by what number of players? All questions are going to be phased out eventually. It makes sense to do it the FIFO way. If it just so happens that some question is seen by very large number of players then it is probably a popular question, and a new player would probably want to answer it as well. — Dialecticus, Jan 09 '12 at 10:10
There are no "popular" questions, they are assigned completely randomly from the pool. — Milan Babuškov, Jan 10 '12 at 22:17
Then the probability that one question has significantly more answers than all others is very low (compensating for question's age), and not worth the effort to alleviate. But, if it's so much of a problem then why not just removing the question with most answers, instead of removing the oldest question? — Dialecticus, Jan 11 '12 at 10:26
Answers are only a small part of the puzzle. What is important are views. If you look at my previous related SO question (link is above in text), you'll see that some players have seen a lot of questions while reviewing them. Take for example player A who has reviewed 40 questions and answered 30 questions during one month vs player B who joined the website recently and only played 15 questions this month with 0 reviews. Obviously, questions that player A has not yet seen are more valuable and should not be released from the pool, regardless of their age. — Milan Babuškov, Jan 11 '12 at 14:35
I assumed that old questions are removed to make room for new questions. In that case there's no harm in removing any old question, because a new one is coming. Is new question available when old one is removed? What would be the reason for some question to be so "privileged" to stay in the pool longer than all the others? Maybe you should have another pool of top popular questions? — Dialecticus, Jan 11 '12 at 16:11
Availability of new questions depends on whether users send them or not. It is not certain that we will have new questions when we remove some. Because of this, removal must be careful. As for the second part of your comment: all questions are treated equally. — Milan Babuškov, Jan 11 '12 at 17:44
Are you removing a question every day or removing all thirty questions at the end of the month? — Kaganar, Jan 13 '12 at 19:21
I suspect you would get a pretty good, fast and easy approximation by repeatedly removing a question answered by the player with the fewest free questions in the remaining set. Perhaps choosing the question with most answers in case of multiple available choices. — Thomas Ahle, Jan 13 '12 at 23:06
In this case, you could limit your problem to only the users who have less than 60 questions left unseen. They are the only ones who are at risk of having zero questions left for that month if 30 questions are removed. Approximately how many total questions are we talking about? And how many total users? — Joel Cornett, Jan 15 '12 at 01:56
@JoelCornett, limiting problem to users with less than 60 questions is useless. But it will be a good optimization to limit the problem to users with difference of less than 30 from the user with minimal number of questions left. — Evgeny Kluev, Jan 15 '12 at 16:41
@EvgenyKluev maybe I'm not seeing what you mean. As far as I can tell, 60 free questions is the minimum a user can have to ensure no possibility that a user will be left with less than 30 questions (1 per day) for the next 30 days. Therefore those with less than 60 questions are the only ones in danger of having a shortage of questions once the 30 questions are removed. They are the only group of interest. — Joel Cornett, Jan 15 '12 at 22:02
@JoelCornett, there is no requirement to reserve exactly 30 questions for next month. Such a requirement is either too hard (because users may get lots of new questions in the very beginning of next month), or too soft (in case of no new questions for next 32 days). — Evgeny Kluev, Jan 16 '12 at 09:23
Speaking about optimizations, it may be useful to filter out "unseen" questions, common to all users. Possibly, LP presolver will do this anyway. But for brute-force and heuristic algorithms it will help. — Evgeny Kluev, Jan 16 '12 at 09:43

score 4 · Answer 1 · answered Jan 05 '12 at 21:02

4

Let's suppose that you have a general efficient algorithm for this. Concentrate on the questions left, rather than the questions removed.

You could use such an algorithm to solve the problem - can you choose at most T questions such that every user has at least one question to answer? I think that this is http://en.wikipedia.org/wiki/Set_cover, and I think solving your problem in general allows you to solve set cover, so I think it is NP-complete.

There is at least a linear programming relaxation. Associate each question with a variable Qi in the range 0<= Qi <= 1. Choosing questions Qi such that each user has at least X questions available amounts to the constraint SUM Uij Qj >= X, which is linear in Qj and X, so you can maximise for the objective function X with the linear variables X and Qj. Unfortunately, the result need not give you integer Qj - consider for example the case when all possible pairs of questions are associated with some user and you want each user to be able to answer at least 1 question, using at most half of the questions. The optimum solution is Qi = 1/2 for all i.

(But given a linear programming relaxation you could use it as the bound in http://en.wikipedia.org/wiki/Branch_and_bound).

Alternatively you could just write down the problem and throw it at an integer linear programming package, if you have one handy.

answered Jan 05 '12 at 21:02

mcdowella

19,301
2
19
25

I used GNU GLPK before (glpsol) and I'm comfortable writing problems in .cplex format, however I seem to be unable to write down the problem. Let's see where am I making a mistake. I believe Qi = 0 means question is removed, and Qi = 1 means question stays. But how do I write the maximize function? I want to maximize number of questions left for user that has least amount of questions. I don't know how to write this down as linear programming problem. – Milan Babuškov Jan 10 '12 at 22:20
Notice that I have defined an extra variable X. My objective function is just the value of X : Objective(X, Q1, Q2...) = X. X is the number of questions for the user with the least, and this is enforced by the constraint SUM_j Uij Qj >= X. This is one of the first tricks (section 6.2) in http://www.aimms.com/aimms/download/manuals/aimms3om_linearprogrammingtricks.pdf – mcdowella Jan 11 '12 at 05:04
1

The wikipedia articles says it's been proved that Set Cover can't be approximated in polynomial time better than the simple greedy approach. Won't that also play in here and make an LP based approximation overkill? – Thomas Ahle Jan 14 '12 at 02:32
1

LP within branch and bound, or ILP, has the advantage over greedy set cover, if you can afford it, of providing a known best answer. When you are happy with a cheap approximate answer, a web search suggests, via http://web.njit.edu/~yang/research/covering/covering.pdf, that ILP may indeed be overkill - thanks. – mcdowella Jan 14 '12 at 06:28
Branch and Bound could utilize any approximation though. Assuming this is actually set-cover of course. LPR is a nice tick though :) – Thomas Ahle Jan 14 '12 at 12:20

Thomas Ahle · Answer 2 · 2012-01-14T13:14:41.160

For completeness of the thread, here is a simple greedy, aproximating approach.

Place the solved questions in the previously discussed matrix form:

Q0    X
Q1  XX
Q2    X
Q3  X  
Q4   XX
    223

Sort by the number of questions solved:

Q0  X  
Q1   XX
Q2  X  
Q3    X
Q4  XX 
    322

Strike out a question with the most Xs among the players with most problems solved. (This is guaranteed to decrease our measure if anything is):

=======
Q1   XX
Q2  X  
Q3    X
Q4  XX 
    222

Sort again:

=======
Q1   XX
Q2  X  
Q3    X
Q4  XX 
    222

Strike again:

=======
=======
Q2  X  
Q3    X
Q4  XX 
    211

Sort again:

=======
=======
Q2  X  
Q3    X
Q4  XX 
    211

Strike again:

=======
=======
Q2  X  
Q3    X
=======
    101

It's O(n^2logn) without optimizations, so it is plenty fast for some hundreds of questions. It's also easy to implement.

It's not optimal as can be seen from this counter example with 2 strikes:

Q0 X     
Q1      X
Q2 XXX
Q3    XXX
Q4  XXXX
Q5 222222

Here the greedy approach is going to remove Q5 and Q2 (or Q3) instead of Q2 and Q3 which would be optimal for our measure.

score 2 · Answer 3 · edited May 23 '17 at 12:29

I propose a bunch of optimizations based on the idea that you really want to maximize the number of unseen questions for the player with the minimum number of questions, and do not care if there is 1 player with the minimum number of questions or 10000 players with that same number of questions.

Step 1: Find the player with the minimum number of questions unseen (In your example, that would be player A) Call this player p.

Step 2: Find all players with within 30 of the number of questions unseen by player p. Call this set P. P are the only players who need to be considered, as removing 30 unseen questions from any other player would still leave them with more unseen questions than player p, and thus player p would still be worse off.

Step 3: Find the intersection of all sets of problems seen by players in P, you may remove all problems within this set, hopefully dropping you down from 30 to some smaller number of problems to remove, that we will call r. r <= 30

Step 4: Find the union of all sets of problems seen by players in P, Call this set U. If the size of U is <= r, you are done, remove all problems in U, and then remove remaining problems arbitrarily from your set of all problems, player p will lose r - size of U and remain with the fewest unseen problems, but this is the best you can do.

You are now left with your original problem, but likely with vastly smaller sets.
Your problem set is U, your player set is P, and you must remove r problems.

The brute force approach takes time (size(U) choose r) * size (P). If those numbers are reasonable, you can just brute force it. This approach is to choose each set of r problems from U and evaluate it against all players in P.

Since your problem does appear to be NP-Complete, the best you can probably hope for is an approximation. The easiest way to do this is to set some max number of tries, then randomly choose and evaluate sets of problems to remove. As such, a function to perform U choose r randomly becomes necessary. This can be done in time O(r), (In fact, I answered how to do this earlier today!)

Select N random elements from a List<T> in C#

You can also put any of the heuristics suggested by other users into your choices by weighting each problem's chance to be selected, I believe the link above shows how to do that in the selected answer.

Ranty · Answer 4 · 2012-01-05T18:36:58.437

Let's say you want to delete Y questions from the pool. The simple algorithm would be to sort questions by the amount of views they had. Then you remove Y of the top viewed questions. For your example: 1: 2, 2: 1, 3: 1, 4: 2, 5: 1. Clearly, you better off removing questions 1 and 4. But this algorithm doesn't achieve the goal. However, it is a good starting point. To improve it, you need to make sure that every user will end up with at least X questions after the "cleaning".

In addition to the above array (which we can call "score"), you need a second one with questions and users, where crossing will have 1 if user have seen the question, and 0 if he didn't. Then, for every user you need to find X questions with lowest score edit: that he hasn't seen yet (the less their score the better, since the less people saw the question, the more "valuable" it is for the system overall). You combine all the found X questions from every user into third array, let's call it "safe", since we won't delete any from it.

As the last step you just delete Y top viewed questions (the ones with the highest score), which aren't in the "safe" array.

What that algorithm achieves also is that if deleting say 30 questions will make some users have less than X questions to view, it won't remove all 30. Which is, I guess, good for the system.

Edit: Good optimization for this would be to track not every user, but have some activity benchmark to filter people that saw only a few questions. Because if there are too many people that saw only say 1 rare different question, then nothing can be deleted. Filtering theese kind of users or improving the safe array functionality can solve it.

Feel free to ask questions if I didn't describe the idea deep enough.

Interesting idea. Let's say that Y is 30, how do I determine value for X? — Milan Babuškov, Jan 10 '12 at 21:00
You specify `X`, not determine. For example, you could start with `X = 1` and see how it turns out, then by running algorithm for `X = 2, 3, 4` you will see how fast the amount of questions you can remove drops. As a side note, you really need to filter "not-very-active" users for this idea to work. — Ranty, Jan 10 '12 at 21:29

score 1 · Answer 5 · answered Jan 05 '12 at 19:48

Have you considered viewing this in terms of a dynamic programming solution?

I think you might be able to do it by maximizing on the number of available questions left open to all players such that no single player is left with zero open questions.

The following link provides a good overview of how to construct dynamic programming solutions to these sort of problems.

Aaron McDaid · Answer 6 · 2012-01-10T23:58:20.540

Presenting this in terms of questions still playable. I'll number the questions from 0 to 4 instead of 1 to 5, as this is more convenient in programming.

          01234
          -----
player A   x x   - player A has just 2 playable questions
player B   xx x  - player B has 3 playable questions
player C  x x x  - player C has 3 playable questions

I'll first describe what might appear to be a very naive algorithm, but at the end I'll show how it can be improved significantly.

For each of the 5 questions, you'll need to decide whether to keep it or discard it. This will require a recursive functions that will have a depth of 5.

vector<bool> keep_or_discard(5); // an array to store the five decisions

void decide_one_question(int question_id) {
    // first, pretend we keep the question
    keep_or_discard[question_id] = true;
    decide_one_question(question_id + 1); // recursively consider the next question

    // then, pretend we discard this question
    keep_or_discard[question_id] = false;
    decide_one_question(question_id + 1); // recursively consider the next question
}

decide_one_question(0); // this call starts the whole recursive search

This first attempt will fall into an infinite recursive descent and run past the end of the array. The obvious first thing we need to do is to return immediately when question_id == 5 (i.e. when all questions 0 to 4 have been decided. We add this code to the beginning of decide_one_question:

void decide_one_question(int question_id) {
    {
        if(question_id == 5) {
            // no more decisions needed.
            return;
        }
    }
    // ....

Next, we know how many questions we are allowed to keep. Call this allowed_to_keep. This is 5-3 in this case, meaning we are to keep exactly two questions. You might set this as a global variable somewhere.

int allowed_to_keep; // set this to 2

Now, we must add further checks to the beginning of decide_one_question, and add another parameter:

void decide_one_question(int question_id, int questions_kept_so_far) {
    {
        if(question_id == 5) {
            // no more decisions needed.
            return;
        }
        if(questions_kept_so_far > allowed_to_keep) {
            // not allowed to keep this many, just return immediately
            return;
        }
        int questions_left_to_consider = 5 - question_id; // how many not yet considered
        if(questions_kept_so_far + questions_left_to_consider < allowed_to_keep) {
            // even if we keep all the rest, we'll fall short
            // may as well return. (This is an optional extra)
            return;
        }
    }

    keep_or_discard[question_id] = true;
    decide_one_question(question_id + 1, questions_kept_so_far + 1);

    keep_or_discard[question_id] = false;
    decide_one_question(question_id + 1, questions_kept_so_far );
}

decide_one_question(0,0);

( Notice the general pattern here: we allow the recursive function call to go one level 'too deep'. I find it easier to check for 'invalid' states at the start of the function, than to attempt to avoid making invalid function calls in the first place. )

So far, this looks quite naive. This is checking every single combination. Bear with me!

We need to start keeping track of the score, in order to remember the best (and in preparation for a later optimization). The first thing would be to write a function calculate_score. And to have a global called best_score_so_far. Our goal is to maximize it, so this should be initialized to -1 at the start of the algorithm.

int best_score_so_far; // initialize to -1 at the start

void decide_one_question(int question_id, int questions_kept_so_far) {
    {
        if(question_id == 5) {
            int score = calculate_score();
            if(score > best_score_so_far) {
                // Great!
                best_score_so_far = score;
                store_this_good_set_of_answers();
            }
            return;
        }
        // ...

Next, it would be better to keep track of how the score is changing as we recurse through the levels. Let's start of by being optimistic; let's pretend we can keep every question and calculate the score and call it upper_bound_on_the_score. A copy of this will be passed into the function every time it calls itself recursively, and it will be updated locally every time a decision is made to discard a question.

void decide_one_question(int question_id
                       , int questions_kept_so_far
                       , int upper_bound_on_the_score) {
     ... the checks we've already detailed above

    keep_or_discard[question_id] = true;
    decide_one_question(question_id + 1
          , questions_kept_so_far + 1
          , upper_bound_on_the_score
        );

    keep_or_discard[question_id] = false;

    decide_one_question(question_id + 1
          , questions_kept_so_far
          , calculate_the_new_upper_bound()
        );

See near the end of that last code snippet, that a new (smaller) upper bound has been calculated, based on the decision to discard question 'question_id'.

At each level in the recursion, this upper bound be getting smaller. Each recursive call either keeps the question (making no change to this optimistic bound), or else it decides to discard one question (leading to a smaller bound in this part of the recursive search).

The optimization

Now that we know an upper bound, we can have the following check at the very start of the function, regardless of how many questions have been decided at this point:

void decide_one_question(int question_id
                       , int questions_kept_so_far
                       , upper_bound_on_the_score) {
        if(upper_bound_on_the_score < best_score_so_far) {
            // the upper bound is already too low,
            // therefore, this is a dead end.
            return;
        }
        if(question_id == 5) // .. continue with the rest of the function.

This check ensures that once a 'reasonable' solution has been found, that the algorithm will quickly abandon all the 'dead end' searches. It will then (hopefully) quickly find better and better solutions, and it can then be even more aggresive in pruning dead branches. I have found that this approach works quite nicely for me in practice.

If it doesn't work, there are many avenues for further optimization. I won't try to list them all, and you could certainly try entirely different approaches. But I have found this to work on the rare occasions when I have to do some sort of search like this.

scowl · Answer 7 · 2012-01-11T02:10:13.827

Here's an integer program. Let constant unseen(i, j) be 1 if player i has not seen question j and 0 otherwise. Let variable kept(j) be 1 if question j is to be kept and 0 otherwise. Let variable score be the objective.

maximize score                                       # score is your objective
subject to

for all i,  score <= sum_j (unseen(i, j) * kept(j))  # score is at most
                                                     # the number of questions
                                                     # available to player i

sum_j (1 - kept(j)) = 30                             # remove exactly
                                                     # 30 questions

for all j,  kept(j) in {0, 1}                        # each question is kept
                                                     # or not kept (binary)

(score has no preset bound; the optimal solution chooses score
 to be the minimum over all players of the number of questions
 available to that player)

Evgeny Kluev · Accepted Answer · 2012-01-15T10:04:38.883

Linear programming models.

Variant 1.

Sum(Uij * Qj) - Sum(Dij * Xj) + 0     =  0 (for each i)
0             + Sum(Dij * Xj) - Score >= 0 (for each i)
Sum(Qj) = (Number of questions - 30)
Maximize(Score)

Uij is 1 if user i has not seen question j, otherwise it is 0

Dij is element of identity matrix (Dij=1 if i=j, otherwise it is 0)

Xj is auxiliary variable (one for each user)

Variant 2.

Sum(Uij * Qj) >= Score (for each i)
Sum(Qj) = (Number of questions - 30)
No objective function, just check feasibility

In this case, LP problem is simpler, but Score should be determined by binary and linear search. Set current range to [0 .. the least number of unseen questions for a user], set Score to the middle of the range, apply integer LP algorithm (with small time limit). If no solution found, set range to [begin .. Score], otherwise set it to [Score .. end] and continue binary search.

(Optionally) use binary search to determine upper bound for exact solution's Score.

Starting from the best Score, found by binary search, apply integer LP algorithm with Score, increased by 1, 2, ... (and limiting computation time as necessary). At the end, you get either exact solution, or some good approximation.

Here is sample code in C for GNU GLPK (for variant 1):

#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>

int main(void)
{
  int ind[3000];
  double val[3000];
  int row;
  int col;
  glp_prob *lp;

  // Parameters
  int users = 120;
  int questions = 10000;
  int questions2 = questions - 30;
  int time = 30; // sec.

  // Create GLPK problem
  lp = glp_create_prob();
  glp_set_prob_name(lp, "questions");
  glp_set_obj_dir(lp, GLP_MAX);

  // Configure rows
  glp_add_rows(lp, users*2 + 1);
  for (row = 1; row <= users; ++row)
  {
    glp_set_row_bnds(lp, row, GLP_FX, 0.0, 0.0);
    glp_set_row_bnds(lp, row + users, GLP_LO, 0.0, 0.0);
  }
  glp_set_row_bnds(lp, users*2 + 1, GLP_FX, questions2, questions2);

  // Configure columns
  glp_add_cols(lp, questions + users + 1);
  for (col = 1; col <= questions; ++col)
  {
    glp_set_obj_coef(lp, col, 0.0);
    glp_set_col_kind(lp, col, GLP_BV);
  }
  for (col = 1; col <= users; ++col)
  {
    glp_set_obj_coef(lp, questions + col, 0.0);
    glp_set_col_kind(lp, questions + col, GLP_IV);
    glp_set_col_bnds(lp, questions + col, GLP_FR, 0.0, 0.0);
  }
  glp_set_obj_coef(lp, questions+users+1, 1.0);
  glp_set_col_kind(lp, questions+users+1, GLP_IV);
  glp_set_col_bnds(lp, questions+users+1, GLP_FR, 0.0, 0.0);

  // Configure matrix (question columns)
  for(col = 1; col <= questions; ++col)
  {
    for (row = 1; row <= users*2; ++row)
    {
      ind[row] = row;
      val[row] = ((row <= users) && (rand() % 2))? 1.0: 0.0;
    }
    ind[users*2 + 1] = users*2 + 1;
    val[users*2 + 1] = 1.0;
    glp_set_mat_col(lp, col, users*2 + 1, ind, val);
  }

  // Configure matrix (user columns)
  for(col = 1; col <= users; ++col)
  {
    for (row = 1; row <= users*2; ++row)
    {
      ind[row] = row;
      val[row] = (row == col)? -1.0: ((row == col + users)? 1.0: 0.0);
    }
    ind[users*2 + 1] = users*2 + 1;
    val[users*2 + 1] = 0.0;
    glp_set_mat_col(lp, questions + col, users*2 + 1, ind, val);
  }

  // Configure matrix (score column)
  for (row = 1; row <= users*2; ++row)
  {
    ind[row] = row;
    val[row] = (row > users)? -1.0: 0.0;
  }
  ind[users*2 + 1] = users*2 + 1;
  val[users*2 + 1] = 0.0;
  glp_set_mat_col(lp, questions + users + 1, users*2 + 1, ind, val);

  // Solve integer GLPK problem
  glp_iocp param;
  glp_init_iocp(&param);
  param.presolve = GLP_ON;
  param.tm_lim = time * 1000;
  glp_intopt(lp, &param);
  printf("Score = %g\n", glp_mip_obj_val(lp));

  glp_delete_prob(lp);
  return 0;
}

Time limit is not working reliably in my tests. Looks like some bug in GLPK...

Sample code for variant 2 (only LP algorithm, no automatic search for Score):

#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>

int main(void)
{
  int ind[3000];
  double val[3000];
  int row;
  int col;
  glp_prob *lp;

  // Parameters
  int users = 120;
  int questions = 10000;
  int questions2 = questions - 30;
  double score = 4869.0 + 7;

  // Create GLPK problem
  lp = glp_create_prob();
  glp_set_prob_name(lp, "questions");
  glp_set_obj_dir(lp, GLP_MAX);

  // Configure rows
  glp_add_rows(lp, users + 1);
  for (row = 1; row <= users; ++row)
  {
    glp_set_row_bnds(lp, row, GLP_LO, score, score);
  }
  glp_set_row_bnds(lp, users + 1, GLP_FX, questions2, questions2);

  // Configure columns
  glp_add_cols(lp, questions);
  for (col = 1; col <= questions; ++col)
  {
    glp_set_obj_coef(lp, col, 0.0);
    glp_set_col_kind(lp, col, GLP_BV);
  }

  // Configure matrix (question columns)
  for(col = 1; col <= questions; ++col)
  {
    for (row = 1; row <= users; ++row)
    {
      ind[row] = row;
      val[row] = (rand() % 2)? 1.0: 0.0;
    }
    ind[users + 1] = users + 1;
    val[users + 1] = 1.0;
    glp_set_mat_col(lp, col, users + 1, ind, val);
  }

  // Solve integer GLPK problem
  glp_iocp param;
  glp_init_iocp(&param);
  param.presolve = GLP_ON;
  glp_intopt(lp, &param);

  glp_delete_prob(lp);
  return 0;
}

It appears that variant 2 allows to find pretty good approximation quite fast. And approximation is better than for variant 1.

As I understand, I should only replace rand() with 0 or 1 depending if user has seen the question or not. The inner for() loop goes up to users*2 so I'm not sure what to place there. Supposing I had a function that returns 1 or 0 like this: int user_seen_question(int user, int question), how would I replace rand() with this function? — Milan Babuškov, Jan 16 '12 at 13:09
Replace `(rand() % 2)? 1.0: 0.0` to `user_seen_question(int user, int question)? 0.0: 1.0`. Also you may replace `questions2` and (for variant 2) `score`. All other values are `ones` and `zeros`, used for identity matrix, ignored variables, and questions counting, do not change them. Actually all the matrix coefficients in the samples are exactly in the same order as in model descriptions at the top of my answer. — Evgeny Kluev, Jan 16 '12 at 14:23
4869.0 and 4869.0+7 are results of binary search, which I did manually. 4869.0 is the best score approximated. 4869.0+7 is the lowest score, which definitely makes the problem infeasible. The exact value is somewhere in-between. If you have different RNG implementation, you'll get different values for score. — Evgeny Kluev, Jan 16 '12 at 14:23
In your examples, variables are "row" and "col", not "user" and "question". In variant2 it is easy, instead of rand() I write user_seen_question(row, col). But, what to write in variant 1 when row can be 2*users? — Milan Babuškov, Jan 16 '12 at 15:43
Does this score 4869 depend on number of users and questions or it is fixed value? — Milan Babuškov, Jan 16 '12 at 15:44
For the first half of rows you write `!user_seen_question()`, as usual. Second half of rows have zeros in these columns. In other words, change `((row <= users) && (rand() % 2))? 1.0: 0.0;` to `((row <= users) && !user_seen_question(user, question))? 1.0: 0.0;`. — Evgeny Kluev, Jan 16 '12 at 16:10
Yes, score 4869 depends on any input data, on number of users, questions, removed questions, on user_seen_question(). You can obtain this value yourself, starting with least-unseen-questions and doing binary search. For this sample you'll have to write about log(n) different constants in this place, recompile and run for each of them. In real program this should be done automatically. — Evgeny Kluev, Jan 16 '12 at 16:10
Thanks. I will award the bounty now because the time limit is reached, but I might have some more questions when I try to implement it during the next few days. — Milan Babuškov, Jan 16 '12 at 17:50
Feel free to ask anything. Just a couple of advices. @dhakim proposes several optimization steps. Try them. May be GLPK presolver already does all this. May be not. For (variant 2) automatic score search, it may be more convenient (but less portable) to start a separate process for each search step. And kill unfinished processes after some timeout. Since having 2+ core processor is quite usual these days, you can start 2+ LP solver processes at a time to get ternary+ search instead of binary one and to perform several linear search steps at once. — Evgeny Kluev, Jan 16 '12 at 18:25
In the end I used variant 1. It calculates the solution for my problem in about 30 seconds, which is quite fine since I only run it once per month. Thanks again. — Milan Babuškov, Jan 31 '12 at 00:18
Don't forget to test is with larger problem sizes. Integer GLPK solver uses branch-and-cut algorithm, which has exponential complexity. — Evgeny Kluev, Jan 31 '12 at 08:56
30 seconds is the time needed to solve the problem in production: 144 questions and 657 users for January. — Milan Babuškov, Feb 02 '12 at 11:31
But several months later it may be 500 questions / 800 users. Or more. — Evgeny Kluev, Feb 02 '12 at 11:51
Users are in the range from 600-700 for over a year. If there were 500 questions I would be very happy and easily just pick any 30 from the pool as picking would not be so relevant anymore. The fact is that 4-5 passionate users have seen much more questions than the rest of the crowd, so if I had 500 questions I would probably only calculate model for those 5 users. The reason why I need precise calculation is because there are not many questions in the pool. — Milan Babuškov, Mar 07 '12 at 12:58

score 0 · Answer 9 · answered Jan 11 '12 at 02:20

If there are too many options to brute force and there are likely many solutions that are near-optimal (sounds to be the case), consider monte-carlo methods.

You have a clearly defined fitness function, so just make some random assignments score the result. Rinse and repeat until you run out of time or some other criteria is met.

score 0 · Answer 10 · answered Jan 11 '12 at 09:40

the question first seems easy, but after thinking deeper you realize the hardness.

the simplest option would be removing the questions that have been seen by maximum number of users. but this does not take the number of remaining questions for each user into consideration. some too few questions may be left for some users after removing.

a more complex solution would be computing the number of remaining questions for each user after deleting a question. You need to compute it for every question and every user. This task may be time consuming if you have many users and questions. Then you can sum up the number of questions left for all users. And select the question with the highest sum.

I think it would be wise to limit the number of remaining questions for a user to a reasonable value. You can think "OK, this user has enough questions to view if he has more than X questions". You need this because after deleting a question, only 15 questions may be left for an active user while 500 questions may be left for a rare-visiting user. It's not fair to sum 15 and 500. You can, instead, define a threshold value of 100.

To make it easier to compute, you can consider only the users who have viewed more than X questions.

In fact, you would only have to consider users who had 60 or fewer questions left. — Joel Cornett, Jan 15 '12 at 01:42

Removing items from unevenly distributed set

10 Answers10

The optimization

Linked