How to read and write numpy arrays which is larger than main memory?

Question

I have a case where I need to write a very large 2D array to a file(pkl,npy,npz... ). My logic is to get the array part by part and save it to the file sequentially. Also, I want to read the same array, from this file, sequentially. Since the array is too big I cannot do any of this in one go. So my question is, how can I achieve this? Is there an inbuilt or external package that can help me do this? The environment I'm using is python. This is the part of the code that causes a Memory Error.

def generate_arrays():
    model=loadGloveModel('glove.6B.100d.txt')
    clf=pickle.load(open('cluster.pkl','rb'))
    tags=pickle.load(open('tags.pkl','rb'))
    cursor=db.cursor()
    sql="SELECT * FROM tag_data"
    try:
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        print "Error",e
        db.rollback()
    ingre=[]
    keyw=[]
    for i in cursor.fetchall():
        tag=np.zeros(len(tags))
        ing=np.zeros(len(set(clf.labels_)))
        ii=word_tokenize(i[1])
        tt=word_tokenize(i[2])
        for j in ii:
            try:
                vec=model[j]
            except:
                continue
            pos=clf.predict([vec])
            ing[pos] +=1
        for j in tt:
            if j in tags:
                tag[tags.index(j)] +=1
        ingre.append(ing)
        keyw.append(tag)
    return [ingre,keyw]

arr = generate_arrays()
pickle.dump(arr,open('input.pkl','wb'))

I think the problem is due to the low RAM of the machine. Can we open a file stream and write arrays as batches. Similarly, can I read arrays as batches of n rows? Any help would be appreciated.

We need a bigger chunk of code, including the definition of your function `generate_arrays()`... — Arthur Spoon, Nov 29 '17 at 09:15
Possible duplicate of [Python 3 - Can pickle handle byte objects larger than 4GB?](https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb) — Arthur Spoon, Nov 29 '17 at 09:24
@ArthurSpoon Not duplicate. I had already checked that question. I get an memory error on the line where we dump the array. Besides my problem branches from having a low RAM. — Hari Krishnan, Nov 29 '17 at 09:33
Provide a minimal example, all the DB stuff seems highly unnecessary. Perhaps instead of pickling you could use hdf5 (http://www.h5py.org/) to store the 2D array. — Ignacio Vergara Kausel, Nov 29 '17 at 09:34
@IgnacioVergaraKausel, already tried h5py. I get a Memory Error when i load the h5 file. — Hari Krishnan, Nov 29 '17 at 09:37
Have a look at [numpy.memmap](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html) — MB-F, Nov 29 '17 at 09:42

AlEmerich · Accepted Answer · 2017-11-29T12:52:12.440

The best way to achieve is to use generator. Instead of returning the entire array at the end of generate_array(), you will use the yield operator (see Generator). Basically, it will "return" what you yield everytime you call the generator, because it keeps its state in memory.

# size is homw many lines you want to take from cursor.fetchall() every pass
def generate_arrays(size):
    ... # unchanged
    ingre=[]
    keyw=[]
    for i in cursor.fetchall():
        tag=np.zeros(len(tags))
        ing=np.zeros(len(set(clf.labels_)))
        ii=word_tokenize(i[1])
        tt=word_tokenize(i[2])
        for j in ii:
            try:
                vec=model[j]
            except:
                continue
            pos=clf.predict([vec])
            ing[pos] +=1
        for j in tt:
            if j in tags:
                tag[tags.index(j)] +=1
        ingre.append(ing)
        keyw.append(tag)
        if i == 0:
            continue # The next condition will be true but you want the first one
        if i % size == 0: # yield every size loop
            yield ingre, keyw
            # if you don't clean ingre and keyw, you will resend it the next time + the new data and you want to send just the new data
            ingre = keyw = []
     # EDIT: I forgot to yield the rest if the total is not a multiple of size
     yield ingre, keyw

gen = generate_arrays(32) # will take 32 line of cursor.fetchall() then write
for arr in gen:
    pickle.dump(arr,open('input.pkl','a')) # 'a' option to append to a file

EDIT

As asked in comments, here a possible read function:

# n as described in comments, size equivalent of previous code
def load_gen(file_path, n):
    with open(file_path) as f:
        arr = []
        i = 0
        while line:
            line = f.readline()
            arr.append(line)
            if i == 0:
                continue
            if i % n == 0:
                yield arr
                arr = []
            i = i + 1
        yield arr

ADDITIONAL NOTE: BE CAREFUL

I made a mistake when reseting arrays. It should not be

ingre = keyw = []

but

ingre = []
keyw = []

because it appears keyw.append(X) appends X to ingre too.

This is what i've been looking for!!. Can you please tell me how to load n arrays from input.pkl, since i can't load it in one go — Hari Krishnan, Nov 29 '17 at 10:11
@HariKrishnan edit with what you asked. You will use that function exactly like the previous one. Hope it suits your need ! :) — AlEmerich, Nov 29 '17 at 10:25

How to read and write numpy arrays which is larger than main memory?

1 Answers1