How to store gensim's KeyedVectors object in a global variable inside a Redis Queue worker

Question

I'm trying to store data in a global variable inside a Redis Queue (RQ) worker so that this data remains pre-loaded, i.e. it doesn't need to be loaded for every RQ job.

Specifically, I'm working with Word2Vec vectors and loading them using gensim's KeyedVectors.

My app is in Python Flask, running on a Linux server, containerized using Docker.

My goal is to reduce processing time by keeping a handful of large vectors files loaded in memory at all times.

I first tried storing them in global variables in Flask, but then each of my 8 gunicorn workers loads the vectors, which eats up a lot of RAM.

I only need one worker to store a particular vectors file.

I've been told that one solution is to have a set number of RQ workers holding the vectors in a global variable, so that I can control which workers get which vectors files loaded in.

Here is what I have so far:

RQ_worker.py

from rq import Worker, Connection
from gensim.models.keyedvectors import KeyedVectors
from my_common_methods import get_redis

W2V = KeyedVectors.load_word2vec_format('some_path/vectors.bin', binary=True)

def rq_task(some_args):
    # use some_args and W2V to do some processing, e.g.:
    with open(some_args_filename, 'w') as f_out:
        f_out.write(str(W2V['word']))

if __name__ == '__main__':
    with Connection(get_redis()):
        worker = Worker(['default'])
        worker.work()

app.py

from rq import Queue, Connection
from RQ_worker import rq_task

@app.route("/someroute", methods=['POST'])
def some_route():
    # test Redis Queue
    with Connection(get_redis()):
        q = Queue()
        task = q.enqueue(rq_task, some_args)

docker-stack.yml

version: '3.7'

services:
  nginx:
    image: nginx:mainline-alpine
    deploy: ...
    configs: ...
    networks: ...

  flask:
    image: ...
    deploy: ...
    environment: ...
    networks: ...
    volumes: ...

  worker:
    image: ...
    command: python2.7 RQ_worker.py
    deploy:
      replicas: 1
    networks: ...
    volumes:
      - /some_path/data:/some_path/data

configs:
  nginx.conf:
    external: true
    name: nginx.conf

networks:
  external:
    external: true
  database:
    external: true

(I redacted a bunch of stuff from Docker, but can provide more details, if relevant.)

The above generally works, except that the RQ worker seems to load W2V from scratch each time it gets a new job, which defeats the whole purpose. It should keep the vectors stored in W2V as a global variable, so they don't need to be reloaded each time.

Am I missing something? Should I set it up differently?

I've been told that it might be possible to use mmap to load the vectors file into a global variable that the RQ worker sits on, but I'm not sure how that would work with KeyedVectors.

Any advice would be much appreciated!

score 0 · Accepted Answer · answered Nov 11 '19 at 18:46

If you use load_word2vec_format(), the code will always be parsing the (not-native-to-gensim-or-Python) word-vectors format, and allocating new objects/memory to store the results.

You can instead use gensim's native .save() to store in a friendlier format for later native .load() operations. Large arrays of vectors will be stored in separate, memory-map ready files. Then, when you .load(..., mmap='r') those files, even multiple times from different threads or processes within the same container, they'll share the same RAM.

(Note that this doesn't even require any shared globals. The OS will notice that each process is requesting the same read-only memory-mapped file, and automatically share those RAM pages. The only duplication will be redundant Python dicts helping each separate .load() know indexes into the shared-array.)

There are some extra wrinkles to consider when doing similarity-operations on vectors that the model will want to repeatedly unit-norm - see this older answer for more details on how to work-around that:

How to speed up Gensim Word2vec model load time?

(Note that syn0 and syn0_norm have been renamed vectors and vectors_norm in more-recent gensim versions, but the old names might still work with deprecation warnings for a while still.)

Thanks for your reply. I tested out your approach and got it working. You're right - it doesn't need any shared globals, or Redis Queue for that matter. Much appreciated! — lgc, Nov 12 '19 at 08:34

How to store gensim's KeyedVectors object in a global variable inside a Redis Queue worker

1 Answers1