Memory error when creating huge 3D sparse numpy array for one-hot vector

Question

I am trying to create 3 tensors for my language translation LSTM network.

import numpy as np
Num_samples=50000
Time_step=100
Vocabulary=5000
shape = (Num_samples,Time_step,Vocabulary)
encoder_input_data = np.zeros(shape,dtype='float32')
decoder_input_data = np.zeros(shape,dtype='float32')
decoder_target_data = np.zeros(shape,dtype='float32')

Obviously, my machine doesn't have enough memory to do so. Since the data is represented as one-hot vectors, it seems using the function csc_matrix() from scipy.sparse will be the solution, as suggested in this tread and this tread.

But after trying the csc_matrix() and crc_matrix(), it seems they only support 2D array.

Old treads from 6 years ago did talk about this issue, but they are not machine learning orientated.

My question is: Is there any python lib/tool that can help me to create sparse 3D arrays that allows me to store one-hot vectors for machine learning purpose later?

`scipy.sparse` matrices are used in learning, such as in the `sklearn` package. But, no they have not been expanded to 3d. — hpaulj, May 21 '18 at 04:00
If it is the case, is there a way to train my network with the `scipy.sparse` matrices? Since all of my machine learning experience are creating 3D array in the very beginning which contains batch, time_step, and length of the one-hot vector. I am open to new ways to deal with the training data. — Raven Cheuk, May 21 '18 at 04:24
I suddenly have another thought in mind (Which is not directly related to this problem). How strong is the machine that Google use to train their translation deep network? Obviously, 50000 training examples are a not huge data size at all, yet, it takes 23Gb memory for each array already. I would expect Google would use a even larger training examples, say 5 millions training examples. By using the same calcualtio, they would need 2000Gb RAM. Is such a machine even exist? — Raven Cheuk, May 21 '18 at 04:41
@RavenCheuk the sorts of models trained by giant companies are usually trained on large, distributed computing clusters, not a single machine. And probably significantly more training data than 5 million examples — juanpa.arrivillaga, May 21 '18 at 05:27
But yes, you can nowadays buy a machine with 2000GB of ram anyway. — juanpa.arrivillaga, May 21 '18 at 05:27
@RavenCheuk I'm not that familiar with LSTM but I doubt you're using batches that big to train a network (on a GPU?). Can't you use a generator and build smaller batches on the fly? — filippo, May 21 '18 at 10:50
@filippo Maybe I miss-label that dimension, it should be the total number of samples instead of batch. The batch size that I feed to the network is only 256 during training. i.e. each time 256 of training samples is obtained from the 50000 samples and feed to the network. — Raven Cheuk, May 21 '18 at 11:01
You can use [memory mapping](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) to work with very large disk-backed arrays, chunk by chunk. See [this example](https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/). — jdehesa, May 21 '18 at 11:31
@RavenCheuk so at any particular time you just need `256` samples in memory. See @jdehesa suggestion if you're dealing with large arrays or just load the samples for a single batch at each training epoch or preload several batches with a queue — filippo, May 21 '18 at 11:49

Memory error when creating huge 3D sparse numpy array for one-hot vector

0 Answers0