Tensorflow out of memory

Question

I am using tensorflow to build CNN based text classification. Some of the datasets are large and some are small.

I use feed_dict to feed the network by sampling data from system memory (not GPU memory). The network is trained batch by batch. The batch size is 1024 fixed for every dataset.

My question is: The network is trained by batches, and each batch the code retrieve data from system memory. Therefore, no matter how large the dataset is the code should handle it like the same, right?

But I got out of memory problem with large dataset, and for small dataset it works fine. I am pretty sure the system memory is enough for holding all the data. So the OOM problem is about tensorflow, right?

Is it that I write my code wrong, or is it about tensorflow's memory management?

Thanks a lot!

The memory should be released after each `.run` call (except for the variables), so issuing more run calls shouldn't increase your memory usage. — Yaroslav Bulatov, Jun 10 '16 at 05:10
yes, that's what I understand. Do we have a good method to check memory usage on this? Thanks — xyd, Jun 10 '16 at 15:02
You could look at memory allocation [messages](https://github.com/tensorflow/tensorflow/commit/ec1403e7dc2b919531e527d36d28659f60621c9e) (need [verbose logging](http://stackoverflow.com/a/36505898/419116)) — Yaroslav Bulatov, Jun 10 '16 at 15:30

score 2 · Answer 1 · answered Jun 09 '16 at 22:08

2

I think your batch size is way too big with 1024. There is a lot of matrices overhead created, especially if you use AgaGrad Adam and the like, dropout, attention and/or more. Try smaller values, like 100, as batchsize. Should solve and train just fine.

answered Jun 09 '16 at 22:08

Phillip Bock

1,879
14
23

Yes, you are right. A smaller batch size will solve the problem. But my question is why batch size 1024 is OK on small dataset? doesn't tensorflow train the network batch by batch? like loading only one batch to GPU memory and compute. after that just dump the batch and load the next one? Thanks. – xyd Jun 09 '16 at 23:55
How are you defining the vocabulary for your model? Is that dependent on the size of the dataset? – Aaron Jun 10 '16 at 04:13
Well, I guess, the big batch size already consumes a lot of memory. Add the bigger dataset and you run out of memory. But that's a wild guess only... – Phillip Bock Jun 10 '16 at 09:14
@friesel I have a rather large network. My batch size is 5, my image dimensions are 396x396x5 (3D volume). And I'm running out of memory. Are there any workarounds? Can you direct me to any links that outline how I can handle networks that are too large? Keep in mind I'm running on a GTX 1070 with 8GB of RAM lol. – Kendall Weihe Aug 11 '16 at 18:59
Batch size of 5 is rather small. Didn't you write in the original post it was 1024? – Phillip Bock Aug 12 '16 at 09:10

Tensorflow out of memory

1 Answers1