Excessive memory usage for very large Python lists loaded from 90GB of JSON for word2vec

Question

I'm working using a cluster to generate word2vec models using gensim from sentences from medical journals that are stored in JSON files and I'm having trouble with memory usage being too large.

The task is to keep a cumulative list of all sentences up to a particular year, and then generate a word2vec model for that year. Then, add the sentences for the next year to the cumulative list and generate and save another model for that year based on all the sentences.

The I/O on this particular cluster is slow enough and the data large enough (reading 2/3 into memory takes about 3 days) that streaming each JSON from disk for each year's model would have taken forever, so the solution was to load all 90GB of JSON into memory in a python list. I have permission to use up to 256GB of memory for this, but could get more if necessary.

The trouble I'm having is that I'm running out of memory. I have read some other posts about the way Python implements free lists not returning memory to the OS and I think that may be part of the problem, but I am not sure.

Thinking that the free list might be the problem and that maybe a numpy would have a better implementation for a large number of elements, I changed from the cumulative list of sentences to a cumulative array of sentences (gensim requires that sentences be lists of words/strings). But I ran this on a small subset of the sentences and it used slightly more memory, so I am unsure of how to proceed.

If anyone has any experience with this I would be very happy to have your help. Also, if there is anything else that could be changed I would appreciate you telling me as well. The full code is below:

import ujson as json
import os
import sys
import logging
from gensim.models import word2vec
import numpy as np

PARAMETERS = {
    'numfeatures': 250,
    'minwordcount': 10,
    'context': 7,
    'downsampling': 0.0001,
    'workers': 32
}

logger = logging.getLogger()
handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s %(levelname)-8s %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)


def generate_and_save_model(cumulative_sentences, models_dir, year):
    """
    Generates and saves the word2vec model for the given year
    :param cumulative_sentences: The list of all sentences up to the current year
    :param models_dir: The directory to save the models to
    :param year: The current year of interest
    :return: Nothing, only saves the model to disk
    """
    cumulative_model = word2vec.Word2Vec(
        sentences=cumulative_sentences,
        workers=PARAMETERS['workers'],
        size=PARAMETERS['numfeatures'],
        min_count=PARAMETERS['minwordcount'],
        window=PARAMETERS['context'],
        sample=PARAMETERS['downsampling']
    )
    cumulative_model.init_sims(replace=True)
    cumulative_model.save(models_dir + 'medline_abstract_word2vec_' + year)


def save_year_models(json_list, json_dir, models_dir, min_year, max_year):
    """
    :param json_list: The list of json year_sentences file names
    :param json_dir: The directory holding the the sentences json files
    :param models_dir: The directory to serialize the models to
    :param min_year: The minimum value of a year to generate a model for
    :param max_year: The maximum value of a year to generate a model for
    Goes year by year through each json of sentences, saving a cumulative word2vec
    model for each year
    """

    cumulative_sentences = np.array([])

    for json_file in json_list:
        year = json_file[16:20]

        # If this year is greater than the maximum, we're done creating models
        if int(year) > max_year:
            break

        with open(json_dir + json_file, 'rb') as current_year_file:
            cumulative_sentences = np.concatenate(
                (np.array(json.load(current_year_file)['sentences']),
                 cumulative_sentences)
            )

        logger.info('COMPLETE: ' + year + ' sentences loaded')
        logger.info('Cumulative length: ' + str(len(cumulative_sentences)) + ' sentences loaded')
        sys.stdout.flush()

        # If this year is less than our minimum, add its sentences to the list and continue
        if int(year) < min_year:
            continue

        generate_and_save_model(cumulative_sentences, models_dir, year)

        logger.info('COMPLETE: ' + year + ' model saved')
        sys.stdout.flush()


def main():
    json_dir = '/projects/chemotext/sentences_by_year/'
    models_dir = '/projects/chemotext/medline_year_models/'

    # By default, generate models for all years we have sentences for
    minimum_year = 0
    maximum_year = 9999

    # If one command line argument is used
    if len(sys.argv) == 2:
        # Generate the model for only that year
        minimum_year = int(sys.argv[1])
        maximum_year = int(sys.argv[1])

    # If two CL arguments are used
    if len(sys.argv) == 3:
        # Generate all models between the two year arguments, inclusive
        minimum_year = int(sys.argv[1])
        maximum_year = int(sys.argv[2])

    # Sorting the list of files so that earlier years are first in the list
    json_list = sorted(os.listdir(json_dir))

    save_year_models(json_list, json_dir, models_dir, minimum_year, maximum_year)

if __name__ == '__main__':
    main()

It might also have to do with Unicode shenanigans. Could you give the encoding of your JSON files, Python version in use on the cluster and the number given by [`sys.maxunicode`](http://stackoverflow.com/questions/1446347)? — user6758673, Sep 03 '16 at 09:36
Thanks for your response - I had not considered that encoding could be an issue. Entering `file -i thefile.json` tells me that the files are us-ascii encoded. The system runs Python 2.6.6. `sys.maxunicode` returns 114111. — swf, Sep 04 '16 at 15:29
Okay that means your Python unicode strings take 4 bytes per character. While in the JSON you probably have mostly characters with 1 byte per character (depends on the share of Unicode chars, Unicode chars are stored as an escape sequence `\uXXXX`). I think this at least partly explains the excessive memory usage. — user6758673, Sep 04 '16 at 21:30
That is interesting - what would you do to resolve that issue? — swf, Sep 06 '16 at 17:29
I posted some ideas below. But it's still going to be tough to keep everything (including the word2vec model) under 256GB probably. It would be nice if the cluster allowed faster I/O. — user6758673, Sep 07 '16 at 16:58

user6758673 · Answer 1 · 2016-09-07T16:59:26.240

I think you should be able to reduce the memory footprint of the corpus significantly by only explicitly storing the first occurrence of every word. All occurrences after that only need to store a reference to the first. This way you don't spend memory on duplicate strings, at the cost of some overhead. In code it could look something like this:

class WordCache(object):
    def __init__(self):
        self.cache = {}

    def filter(self, words):
        for i, word in enumerate(words):
            try:
                words[i] = self.cache[word]
            except KeyError:
                self.cache[word] = word
        return words

cache = WordCache()
...
for sentence in json.load(current_year_file)['sentences']:
    cumulative_sentences.append(cache.filter(sentence))

Another thing you might try is moving to Python 3.3 or above. It has a more memory efficient representation of Unicode strings, see PEP 393.

Excessive memory usage for very large Python lists loaded from 90GB of JSON for word2vec

1 Answers1