8

I am trying to get reproducible results with the genetic programming code in chapter 11 of "Programming Collective Intelligence" by Toby Segaran. However, simply setting seed "random.seed(55)" does not appear to work, changing the original code "from random import ...." to "import random" doesn't help, nor does changing Random(). These all seem to do approximately the same thing, the trees start out building the same, then diverge.

In reading various entries about the behavior of random, I can find no reason, given his GP code, why this divergence should happen. There doesn't appear to be anything in the code except calls to random, that has any variability that would account for this behavior. My understanding is that calling random.seed() should set all the calls correctly and since the code isn't threaded at all, I'm not sure how or why the divergence is happening.

Has anyone modified this code to behave reproducibly? Is there some form of calling random.seed() that may work better?

I apologize for not posting an example, but the code is obviously not mine (I'm adding only the call to seed and changing how random is called in the code) and this doesn't appear to be a simple issue with random (I've read all the entries on Python random here and many on the web in general).

Thanks. Mark L.

  • "... the trees start out building the same, then diverge." Is it possible that the code reseeds random elsewhere? – Steven Rumbalski Jul 07 '11 at 17:25
  • I suppose, but I cannot find where. The original code doesn't use seed, I added it myself (three different ways, no less). I am at a loss as to how it is reseeding and where I might change the code to fix that. All of the modifications I find in the code seem to come from random calls (random, choice, randint and randrange) within the various defs. Hopefully there is a way to make seed override whatever is causing this to happen. – Mark A. Lefebvre Jul 07 '11 at 17:34
  • Could any library you're using be calling random for any reason? – Thomas K Jul 07 '11 at 17:37
  • I think we'll need to see the code. – Steven Rumbalski Jul 07 '11 at 17:42
  • What kind of floating point issue would it be? The floats should all be the same based on the random seed, I would think. I've gone so far as to try the latest python 2.6 version as well, to no avail. – Mark A. Lefebvre Jul 07 '11 at 18:15
  • Yes, I had, but your idea is the only one that makes any sense! I checked all of the random calls individually to make sure they did the right thing and they all pass my tests too. – Mark A. Lefebvre Jul 07 '11 at 18:46
  • Have you tried using `multiprocessing` to run the code multiple times at once to see if it could be a timing-based thing? – JAB Jul 07 '11 at 20:42

3 Answers3

5

I had the same problem just now with some completely unrelated code. I believe my solution was similar to that in eryksun's answer, though I didn't have any trees. What I did have were some sets, and I was doing random.choice(list(set)) to pick values from them. Sometimes my results (the items picked) were diverging even with the same seed each time and I was close to pulling my hair out. After seeing eryksun's answer here I tried random.choice(sorted(set)) instead, and the problem appears to have disappeared. I don't know enough about the inner workings of Python to explain it.

tremby
  • 9,541
  • 4
  • 55
  • 74
  • 1
    You're issue would arise from the use of the `set`, not the `random` module. Sets are unordered, just like python dictionaries – Zizouz212 Aug 29 '16 at 21:11
  • 1
    I would have thought the order given that the elements are added in the same order would still be deterministic? If not, why not? Where is the random element introduced? – tremby Aug 29 '16 at 22:08
  • 1
    Think of the set as a list. Items in the set don't have a specific, constant position, and they can move around. This is because the items are hashed, which is also the mechanism Python uses to make all set items unique. This behaviour is seen in dictionaries also: the keys are not ordered because they are hashed. The reason why sorting the set works, is because now each item has a position, and it doesn't move. This position is the same at all times in the sorted container. The random function just generates a random number to index the container, and has no knowledge of the item's type. – Zizouz212 Aug 29 '16 at 22:17
  • Yes, I understand that. But the same program run twice, adding the same items to the set in the same order (I realize they are unordered, but my point is that I have taken the same steps) was then behaving differently each time, when the same RNG seeded the same way wanted to take a random value from that set. That is the part which I don't understand. Ordered or not, in those circumstances I still expected the same output. – tremby Aug 29 '16 at 22:36
  • 3
    Ah, I missed a crucial point. The internal hash function isn't a true "hash" of the data (like you'd expect from sha256 or md5). Instead, it's derived from the `id` function, which is basically the memory address of the object. So it will be different every time. You can see it. Create two string objects, and pass them through the `id` and `hash` functions, and they will be similar. This may also be of interest as to why they are different: http://stackoverflow.com/a/17192858/4293417 – Zizouz212 Aug 29 '16 at 22:42
  • 1
    That explains it. Thanks! This is potentially off-topic now, but do you know of any way to make that id predictable (seeded?) between runs of the program? – tremby Aug 29 '16 at 23:01
  • Controlling the memory address of the object is next to impossible. What you need is to have the objects in the same position for each run. For that, you'd need to have a container where objects retain a position (a container where objects aren't hashed). Dictionaries and sets are out of the question, and you need something that's ordered. Unless you need the functionality of a set, a `list` is probably your best bet, for built-ins. Otherwise, the [`collections`](https://docs.python.org/2/library/collections.html#module-collections) module will likely have what you need. – Zizouz212 Aug 29 '16 at 23:07
  • 1
    Okay, yeah, that's impractical. Thanks for solving the mystery! – tremby Aug 29 '16 at 23:28
  • 1
    In case anyone else gets this deep in the rabbit hole: there exists an [**`OrderedSet`**](https://pypi.org/project/ordered-set/) package. – danuker Nov 16 '20 at 11:00
3

This may help, to create a random object that won't be interfered with from elsewhere:

from random import Random
random = Random(55)
# random can be used like the plain module

If other libraries are calling random.seed for any reason, they won't affect the random object you've created for your program.

Thomas K
  • 39,200
  • 7
  • 84
  • 86
  • This does not appear to work either (I had tried something similar before), suffering the same basic problem. I am at a loss to explain this behavior. I happens on at least two versions of Python on Windows 7 and Linux. Seed appears to work for at least a little while. – Mark A. Lefebvre Jul 07 '11 at 18:53
3

I added the following function to gp.py, changing nothing else:

def set_seed(n):
  import random
  random.seed(n)

I'm using the module based on the example on page 267 (Google books). I can confirm that I get divergent results for the following trial:

>>> import gp
>>> gp.set_seed(55)
>>> rf = gp.getrankfunction(gp.buildhiddenset())
>>> gp.evolve(2, 500, rf, mutationrate=0.2, breedingrate=0.1, pexp=0.7, pnew=0.1)

It starts to diverge as early as the 4th value printed. I'm restarting the interpreter between trials, so it's not any prior state that's causing the problem.

Edit:

I found the random element. It's the memory address of the trees. The rank function sorts the list of results, for which each item is a tuple of the score and tree. Between runs the addresses change, and so the relative sort order of trees with equal score isn't a constant. Fortunately Python has a stable sort, so the fix is simple enough. Just use a sort key to sort based on only the score:

def getrankfunction(dataset):
  def rankfunction(population):
    scores=[(scorefunction(t, dataset), t) for t in population]
    scores.sort(key=lambda x: x[0])
    return scores
  return rankfunction
Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
  • I have tried running it as outlined in the book on page 267 from within the interpreter. I can see where the stuff on page 260 might work, it seems to be either mutation or crossover (both?) that is being inconsistent (despite using only calls to random), that is added on page 265 and 266 (all code available at the O'Reilly website). – Mark A. Lefebvre Jul 07 '11 at 19:58
  • Thank goodness it is not just me! Any theories? It is bothering me that something this simple should be so obviously broken. I apologize for not posting the exact code above, I got sidetracked in the middle of typing and was then interrupted. Glad you were able to find it from the text. – Mark A. Lefebvre Jul 07 '11 at 20:30
  • This explains why I could not find it. Thank you for your help and that wonderful fix! – Mark A. Lefebvre Jul 08 '11 at 05:58