1

I need to write a fully reproducible Word2Vec test, and need to set PYTHONHASHSEED to a fixed value. This is my current set-yp

# conftest.py
@pytest.fixture(autouse=True)
def env_setup(monkeypatch):
    monkeypatch.setenv("PYTHONHASHSEED", "123")

# test_w2v.py

def test_w2v():
    assert os.getenv("PYTHONHASHSEED") == "123"
    expected_words_embeddings = np.array(...)
    w2v = Word2Vec(my_tokenized_sentences, workers=1, seed=42, hashfxn=hash)
    words_embeddings = np.array([w2v.wv.get_vector(word) for word in sentence for sentence in my_tokenized_sentences)])
    np.testing.assert_array_equal(expected_words_embeddings, words_embeddings)

Here is the curious thing.

If I run the test from the terminal by doing PYTHONHASHSEED=123 python3 -m pytest test_w2v.py the test passes without any issues. However, if I run the test from PyCharm (using pytest, set up from Edit Configurations -> Templates -> Python tests -> pytest) then it fails. Most interestingly, it doesn't fail at assert os.getenv("PYTHONHASHSEED") == "123", but it fails at np.testing.assert_array_equal(expected_words_embeddings, words_embeddings)

Why could this be the case, and is there a way to fix this issue?

andrea
  • 482
  • 5
  • 22
  • According to the answer to [this question](https://stackoverflow.com/questions/30585108/disable-hash-randomization-from-within-python-program) in may be that the env var is set too late in PyCharm. One workaround would be to set the variable in the PyCharm startup script. – MrBean Bremen Mar 29 '20 at 12:18

1 Answers1

3

You can't set PYTHONHASHSEED in Python code; it needs to be set before the Python interpreter starts, because that's the only time it's consulted by the interpreter. You could possibly set it globally, before launching PyCharm, or there may be a PyCharm option to set environment variables for whatever execution environment you're triggering from PyCharm. (See for example: How to set environment variables in PyCharm? )

But more generally, you generally shouldn't be trying to make your gensim Word2Vec tests this deterministic.

If whatever you're testing is that sensitive to exact parameters – because only an exact seeding & (much slower) single-threaded training gets within your chosen tolerances, or gets an exact answer you copied from an earlier run – then you're not really verifying the algorithm's contributions under the sorts of real randomness that it is typically subject-to. See more discussion in the gensim FAQ.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • There are good reasons to fix randomness. If one is refactoring complex data science code (e.g. for production engineering), it's handy to be certain that code changes have precisely no functional impact. Also handy for reproducing or ruling out randomness when debugging intermittent production faults. – Michael Grazebrook Jan 25 '21 at 14:29
  • Sure, in some cases. But most of the people I see asking to do this with regard to `Word2Vec` on StackOverflow, the Gensim Github Issues, or the Gensim discussion list are instead trying to avoid reckoning with the inherent variability of this randomized algorithm (especially in the much-higher-throughput multithreaded version), & keep less-robust coding/testing practices that should be made more tolerant of 'jitter' in the results. – gojomo Jan 25 '21 at 17:51
  • Your point is valid - for the research phase. But once you get to production, techniques like these are very useful for support. Within code, you rarely need PYTHONHASHSEED since you can use random.seed(), numpy.random.seed and tensorflow..random.set_seed depending on which libraries you're using. You can still get differences in some objects, for example if they get the current time. – Michael Grazebrook Jan 28 '21 at 00:29
  • 1
    In production, most will be using multiple threads because throughput will be important, which destroys reproducibility due to other thread- scheduling jitter. – gojomo Jan 28 '21 at 00:57