1

I have a case where I need to write a very large 2D array to a file(pkl,npy,npz... ). My logic is to get the array part by part and save it to the file sequentially. Also, I want to read the same array, from this file, sequentially. Since the array is too big I cannot do any of this in one go. So my question is, how can I achieve this? Is there an inbuilt or external package that can help me do this? The environment I'm using is python. This is the part of the code that causes a Memory Error.

def generate_arrays():
    model=loadGloveModel('glove.6B.100d.txt')
    clf=pickle.load(open('cluster.pkl','rb'))
    tags=pickle.load(open('tags.pkl','rb'))
    cursor=db.cursor()
    sql="SELECT * FROM tag_data"
    try:
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        print "Error",e
        db.rollback()
    ingre=[]
    keyw=[]
    for i in cursor.fetchall():
        tag=np.zeros(len(tags))
        ing=np.zeros(len(set(clf.labels_)))
        ii=word_tokenize(i[1])
        tt=word_tokenize(i[2])
        for j in ii:
            try:
                vec=model[j]
            except:
                continue
            pos=clf.predict([vec])
            ing[pos] +=1
        for j in tt:
            if j in tags:
                tag[tags.index(j)] +=1
        ingre.append(ing)
        keyw.append(tag)
    return [ingre,keyw]

arr = generate_arrays()
pickle.dump(arr,open('input.pkl','wb')) 

I think the problem is due to the low RAM of the machine. Can we open a file stream and write arrays as batches. Similarly, can I read arrays as batches of n rows? Any help would be appreciated.

Hari Krishnan
  • 2,049
  • 2
  • 18
  • 29

1 Answers1

4

The best way to achieve is to use generator. Instead of returning the entire array at the end of generate_array(), you will use the yield operator (see Generator). Basically, it will "return" what you yield everytime you call the generator, because it keeps its state in memory.

# size is homw many lines you want to take from cursor.fetchall() every pass
def generate_arrays(size):
    ... # unchanged
    ingre=[]
    keyw=[]
    for i in cursor.fetchall():
        tag=np.zeros(len(tags))
        ing=np.zeros(len(set(clf.labels_)))
        ii=word_tokenize(i[1])
        tt=word_tokenize(i[2])
        for j in ii:
            try:
                vec=model[j]
            except:
                continue
            pos=clf.predict([vec])
            ing[pos] +=1
        for j in tt:
            if j in tags:
                tag[tags.index(j)] +=1
        ingre.append(ing)
        keyw.append(tag)
        if i == 0:
            continue # The next condition will be true but you want the first one
        if i % size == 0: # yield every size loop
            yield ingre, keyw
            # if you don't clean ingre and keyw, you will resend it the next time + the new data and you want to send just the new data
            ingre = keyw = []
     # EDIT: I forgot to yield the rest if the total is not a multiple of size
     yield ingre, keyw

gen = generate_arrays(32) # will take 32 line of cursor.fetchall() then write
for arr in gen:
    pickle.dump(arr,open('input.pkl','a')) # 'a' option to append to a file

EDIT

As asked in comments, here a possible read function:

# n as described in comments, size equivalent of previous code
def load_gen(file_path, n):
    with open(file_path) as f:
        arr = []
        i = 0
        while line:
            line = f.readline()
            arr.append(line)
            if i == 0:
                continue
            if i % n == 0:
                yield arr
                arr = []
            i = i + 1
        yield arr

ADDITIONAL NOTE: BE CAREFUL

I made a mistake when reseting arrays. It should not be

ingre = keyw = []

but

ingre = []
keyw = []

because it appears keyw.append(X) appends X to ingre too.

AlEmerich
  • 430
  • 4
  • 13
  • This is what i've been looking for!!. Can you please tell me how to load n arrays from input.pkl, since i can't load it in one go – Hari Krishnan Nov 29 '17 at 10:11
  • @HariKrishnan edit with what you asked. You will use that function exactly like the previous one. Hope it suits your need ! :) – AlEmerich Nov 29 '17 at 10:25