I have a case where I need to write a very large 2D array to a file(pkl,npy,npz... ). My logic is to get the array part by part and save it to the file sequentially. Also, I want to read the same array, from this file, sequentially. Since the array is too big I cannot do any of this in one go. So my question is, how can I achieve this? Is there an inbuilt or external package that can help me do this? The environment I'm using is python. This is the part of the code that causes a Memory Error.
def generate_arrays():
model=loadGloveModel('glove.6B.100d.txt')
clf=pickle.load(open('cluster.pkl','rb'))
tags=pickle.load(open('tags.pkl','rb'))
cursor=db.cursor()
sql="SELECT * FROM tag_data"
try:
cursor.execute(sql)
db.commit()
except Exception as e:
print "Error",e
db.rollback()
ingre=[]
keyw=[]
for i in cursor.fetchall():
tag=np.zeros(len(tags))
ing=np.zeros(len(set(clf.labels_)))
ii=word_tokenize(i[1])
tt=word_tokenize(i[2])
for j in ii:
try:
vec=model[j]
except:
continue
pos=clf.predict([vec])
ing[pos] +=1
for j in tt:
if j in tags:
tag[tags.index(j)] +=1
ingre.append(ing)
keyw.append(tag)
return [ingre,keyw]
arr = generate_arrays()
pickle.dump(arr,open('input.pkl','wb'))
I think the problem is due to the low RAM of the machine. Can we open a file stream and write arrays as batches. Similarly, can I read arrays as batches of n rows? Any help would be appreciated.