4

I have a fairly large list of lists representing the tokens in the Sogou text classification data set. I can process the entire training set of 450 000 with 12 gigs of ram left over, but when I call numpy.save() on the list of lists the memory usage seems to double and I run out of memory.

Why is this? Does the numpy.save convert the list before saving but retain the list thus using more memory?

Is there an alternative way to save this list of lists i.e pickling? I believe numpy save uses the pickle protocol judging from the allow pickle argument: https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html

print "Collecting Raw Documents, tokenize, and remove stop words"
df = pd.read_pickle(path + dataSetName + "Train")
frequency = defaultdict(int)

gen_docs = []
totalArts = len(df)
for artNum in range(totalArts):
    if artNum % 2500 == 0:
        print "Gen Docs Creation on " + str(artNum) + " of " + str(totalArts)
    bodyText = df.loc[artNum,"fullContent"]
    bodyText = re.sub('<[^<]+?>', '', str(bodyText))
    bodyText = re.sub(pun, " ", str(bodyText))
    tmpDoc = []
    for w in word_tokenize(bodyText):
        w = w.lower().decode("utf-8", errors="ignore")
        #if w not in STOPWORDS and len(w) > 1:
        if len(w) > 1:
            #w = wordnet_lemmatizer.lemmatize(w)
            w = re.sub(num, "number", w)
            tmpDoc.append(w)
            frequency[w] += 1
    gen_docs.append(tmpDoc)
print len(gen_docs)

del df
print "Saving unfiltered gen"
dataSetName = path + dataSetName
np.save("%s_lemmaWords_noStop_subbedNums.npy" % dataSetName, gen_docs)
Kevinj22
  • 966
  • 2
  • 7
  • 11

1 Answers1

3

np.save first tries to convert the input into an array. After all it is designed to save numpy arrays.

If the resulting array is multidimensional with numeric or string values (dtype) it saves some basic dimension information, plus a memory copy of the arrays data buffer.

But if the array contains other objects (e.g. dtype object), then it pickles those objects, and saves the resulting string(s).

I would try

arr = np.array(gen_docs)

Does that produce a memory error?

If not, what is its shape and dtype?

If the tmpDoc (sublists) vary in length the arr will be a 1d array with object dtype - those objects being the tmpDoc lists.

Only if all the tmpDoc have the same length will it produce a 2d array. Even then the dtype will depend on the elements, whether numbers, strings, or other objects.

I might add that an array is pickled with the save protocol.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I already tried: gen_docs = np.array(gen_docs) and then saving it, same issue. The tmpDoc sublists do vary in length. I'll update this with a shape and dtype after the code runs again. – Kevinj22 Dec 17 '17 at 18:38
  • If you can't turn the whole list into an array, it might still be useful to convert part of it with: `np.rray(gen_docs[:100])`. The same questions about shape and dtype apply. – hpaulj Dec 17 '17 at 19:09
  • Running it with 2500 documents just to check the shape and dtype this is what I got: (2500,) object – Kevinj22 Dec 17 '17 at 19:20
  • 1
    So it's a 1d array of those `tmpDoc` lists. There's no real benefit to trying to save it with `numpy`. – hpaulj Dec 17 '17 at 19:32
  • @hpaulj, I tried everything to save numpy array (1D float): mgzip, pickle, gz2, plain file, numpy.save*, blosc+pickle. In any case without succes, python process has been killed because of lack of memory (even I have 276GB swapp on SSD partition). I couldn't manage to save slices, thought. – Tedo Vrbanec Sep 08 '22 at 18:35