5

I have a priority queue of "door numbers". I get the next door number from the priority queue (i.e. the door with the lowest corresponding priority value), and then open the door. Behind the door, there may be a gift or not. Based on the presence / absence of a gift, I update the priority for this door number, and put it back into the priority queue. I then repeat, getting the next door number to open, and so on.

Assuming every door has a different gift-replenishment rate (i.e. some may get a new gift daily, others never at all), how should I update the priority values in order to maximize the number of gifts I find? That is, I want to maximize the ratio of doors I open with gifts to doors I open without gifts.

I should note that replenishment rates are not guaranteed to be fixed over time / there is random variation. But I'm okay with simplifying assumptions here.

This almost seems like a Monte-Carlo problem to me, except that the more often I explore a node (door), the lower its expected value. (And of course there's no tree to build; we only need to figure out the value of depth-1 nodes.)

The most trivial way is to keep track of last priority (LP) and current priority (CP), with delta = CP - LP. If we find a gift, set the next priority NP = CP + delta - 1; otherwise set NP = CP + delta + 1. This works I guess, but seems rather slow in its optimization.

Or we could have a multiplicative value instead: NP = CP + delta * shrink or NP = CP + delta * grow, where shrink < 1 and grow > 1. This is what I currently have, and it seemed to work fine for months, but now I'm getting the situation where some doors are being opened back-to-back (i.e. open door D, found gift, put back on priority queue, D is now best choice again, no gift found of course, now put back on queue with worse priority) which seems pretty bad. For reference, I used shrink = 0.9 and grow = 1.3.

Is there a math formula (as with Monte-Carlo) expressing the optimal way to explore doors?

Philip H
  • 300
  • 1
  • 17
  • 1
    Awesome question! This sounds almost exactly like a multi-armed bandit problem, except multi-armed bandit problems most often don't have "memory", ie, what you find behind a door doesn't depend on how long it's been since you last looked. – Stef Sep 01 '21 at 15:47
  • 1
    Yes, that's what I meant by Monte-Carlo problem, sorry if that was unclear (Monte Carlo tree search is based on the UCB1 algorithm for multi-armed bandit problem: https://en.wikipedia.org/wiki/Monte_Carlo_tree_search#Exploration_and_exploitation). Similarly I am thinking of how to minimize regret... – Philip H Sep 01 '21 at 16:15
  • 1
    Interesting question! Is there an element of time in the rate of opening doors as well? Or is it assumed that the replenish rates are expressed in the speed of opening (e.g., door X replenish ~every 4 openings)? – Vincent van der Weele Sep 01 '21 at 18:04
  • The doors are opened at a fixed rate (say D doors daily), so the replenish rates and priority values can just be expressed in those terms (as you said in your 2nd sentence) or just in terms of time units / days. The doors will continue to be opened at this rate, even if we expect nothing behind any of them, so we're not able to / trying to reduce the number of door visits, just maximize the number of successful visits. – Philip H Sep 01 '21 at 20:03
  • Applications / analogies: (1) Web crawler deciding which (known) webpage to visit where success = visiting a webpage that has been altered since we last saw it (i.e. updating our cache; visit is wasted if no changes found). (2) Porch pirate deciding which home to visit next where success = home with packages to steal outside (though this analogy only works if we assume people leave packages out forever until the pirate gets them, and all successes are of equivalent value). A webpage / house could get multiple updates / deliveries daily, but maybe we should lower-bound visits to daily. – Philip H Sep 01 '21 at 20:19

1 Answers1

0

Multi-armed bandit theory runs deep and is not my specialty, so there's probably a reference that I don't know about. That being said, my first instinct is:

  • Simplify the math with the spherical-cow assumption that, for each door, the replenishment time is exponentially distributed with some unknown rate that stays constant over time.
  • Separate out our estimate of the replenishment rate from the history.
  • Set the priority of each door to 1 − exp(−λx) where λ is the estimated replenishment rate and x is the time since we last opened the door. (Higher is better.)

Multi-armed bandits typically have to balance exploration with exploitation, but my hunch here is that we'll get this naturally from the replenishment process.

Most of the technical detail is in doing the estimate. We have a bunch of examples (x, b) where x is the time since we last opened the door and b is whether there was a gift. For a given rate λ, the formula above for the priority gives the expected value of b. I'll suggest a maximum likelihood estimator for λ. This means maximizing the sum of log(exp(−λx)) = −λx over all (x, 0) examples plus the sum of log(1 − exp(−λx)) over all (x, 1) examples. This function can be optimized directly, but there are two issues:

  • The more times we open a door, the more expensive the optimization gets.
  • If there are no positive or negative examples, then the solution is degenerate. Probably we should require λ be at least monthly or something to avoid giving up on a door entirely.

What I would actually recommend is picking a small set of λ values to make this a discrete optimization problem.

(Another potential problem is that the priority formula could be inefficient for many doors. What you could do instead is pick a target threshold for priority and then calculate when the priority will exceed that threshold.)

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120