Using tf.data.Dataset makes saved model bigger

Question

I recently have an issue with saving the model in a bigger size. I am using tensorflow 1.4

Before, I used

tf.train.string_input_producer() and tf.train.batch()

to load images from a text file. And in the training,

tf.train.start_queue_runners() and tf.train.Coordinator()

were used to provide data to the network. In this case, every time I saved the model using

saver.save(sess, checkpoint_path, global_step=iters)

only gave me a small size file, i.e. a file named model.ckpt-1000.data-00000-of-00001 with 1.6MB.

Now, I use

tf.data.Dataset.from_tensor_slices()

to supply images to an input placeholder and the saved model become 290MB. But I don't know why. I suspect the tensorflow saver saved the dataset into the model as well. If so, how to remove them to make it smaller, and only the weights of the network are saved.

This is not network depended because I tried in two networks and they were all like that.

I have googled but unfortunately didn't see any inspiration related to this issue. (Or this is not an issue, just I don't know how do?)

Thank you very much for any idea and help!

Edit

The method I initialised the dataset is:

1.First generated numpy.array dataset:

self.train_hr, self.train_lr = cifar10.load_dataset(sess)

The initial dataset is numpy.array, for example [8000,32,32,3]. I passed sess into this function is because in the function, I did tf.image.resize_images() and use sess.run() to generate numpy.array. The returns self.train_hr and self.train_lr are numpy.array in shape [8000,64,64,3].

2.Then I created the dataset:

self.img_hr = tf.placeholder(tf.float32)
self.img_lr = tf.placeholder(tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((self.img_hr, self.img_lr))
dataset = dataset.repeat(conf.num_epoch).shuffle(buffer_size=conf.shuffle_size).batch(conf.batch_size)
self.iterator = dataset.make_initializable_iterator()
self.next_batch = self.iterator.get_next()

3.Then I initialised network and dataset, did the training and saved model:

self.labels = tf.placeholder(tf.float32,
                                     shape=[conf.batch_size, conf.hr_size, conf.hr_size, conf.img_channel])
self.inputs = tf.placeholder(tf.float32,
                                     shape=[conf.batch_size, conf.lr_size, conf.lr_size, conf.img_channel])
self.net = Net(self.labels, self.inputs, mask_type=conf.mask_type,
                       is_linear_only=conf.linear_mapping_only, scope='sr_spc')

sess.run(self.iterator.initializer,
                         feed_dict={self.img_hr: self.train_hr, self.img_lr: self.train_lr})
while True:
    hr_img, lr_img = sess.run(self.next_batch)
    _, loss, summary_str = sess.run([train_op, self.net.loss, summary_op],
                                    feed_dict={self.labels: hr_img, self.inputs: lr_img})
    ...
    ...
    checkpoint_path = os.path.join(conf.model_dir, 'model.ckpt')
    saver.save(sess, checkpoint_path, global_step=iters)

All the sess are the same instance.

score 2 · Answer 1 · answered Apr 09 '18 at 12:51

2

I suspect you created a tensorflow constant tf.constant out of your dataset, which would explain why the dataset gets stored with the graph. There is an initializeable dataset which let's you feed in the data using feed_dict at runtime. It's a few extra lines of code to configure but it's probably what you wanted to use.

https://www.tensorflow.org/programmers_guide/datasets

Note that constants get created for you automatically in the Python wrapper. The following statements are equivalent:

tf.Variable(42)
tf.Variable(tf.constant(42))

answered Apr 09 '18 at 12:51

David Parks

30,789
47
185
328

Thanks for your answer. I should clarify how I initialised the dataset. I just updated the question. I used the initializeable dataset. The reason is if I use make_one_shot, it will add dataset to the graph, and then I got an error saying the graph is more than 2GB..... . With this one, I can feed_dict the data to the graph. But the saved model was still big. That confused me. – F Bai Apr 09 '18 at 15:23
Ah, well then this is not the answer. Perhaps the Dataset object stores things like indexes to the dataset then. If you have a large dataset with millions of samples I could imagine indexing constants being stored as part of (possibly multiple) Dataset objects. But that's a total guess. – David Parks Apr 09 '18 at 15:32
Yes, I am guessing so. do you know how to check what is saved in the model and their size? I tried `print_tensors_in_checkpoint_file` but it cannot show the size of different tensors.... – F Bai Apr 09 '18 at 16:06
You could loop over the names of the tensor getting them out of an active session and print out their shape. Here's a way to get all the tensors in a graph by name: https://stackoverflow.com/questions/36883949/in-tensorflow-get-the-names-of-all-the-tensors-in-a-graph. There's a tensorflow debugger which is only a few lines of code to set up too, but I guess the loop will be easier. – David Parks Apr 09 '18 at 16:09

mr_mo · Answer 2 · 2018-04-09T13:03:09.590

Tensorflow indeed saves your dataset. To solve it, lets understand why.

How tensorflow works and what does it save?

In short, Tensorflow API lets you build a computation graph via code, and then optimize it. Every op/variable/constant you define in the graph is working on tensors and is part of that graph. This framework is convenient since Tensorflow just build a graph, then the framework decides (or you specify) where to compute the graph in order to gain maximum speed out of your hardware, for instance, by computing on your GPU.

The GPU is a great example since this is a great example for your issue. Sending data from HDD/RAM/Processor to GPU is expensive time-wise. Therefore, Tensorflow also allow you to create input producers that will pretty much automatically manage the data transferred between all peripheral units, by queuing them and managing threads. However, I haven't seen much gain from that approach. Note that the inputs produced by datasets are also tensors, specifically constants/variables that are used as input to the network.. Therefore, they are part of the graph.

When saving a graph, we save several things:

Metadata - which defines the graph and its structure.
Values - of each variable/constant in the graph, in order to load it and reuse the network.

When you use datasets, the values of the non-trainable variables are saved, and therefore, your checkpoint file is larger.

To better understand datasets, see its implementation in the package files.

TL;DR - How do I fix my problem?

If its not reducing performance, use feeding dictionary to feed placeholders. Do not use tensors to store your data. This way those variables will not be saved.
Save only tensors that you would like to load (weights, biases, etc). You can user .eval() method to find its values, save it as JSON or such, and load it later by reconstructing the graph.

Good luck!

Thank you for the answer. However, I used feed_dict and `dataset.make_initializable_iterator()` which are designed not including big dataset to graph node. I updated the initialisation method in the question post. — F Bai, Apr 09 '18 at 15:51

score 0 · Answer 3 · answered Apr 09 '18 at 20:38

I solved this issue (not perfectly as I still don't know where the problem happens). Instead, I made a workaround to avoid saving a large amount of data.

I defined a saver fed in a specific list of variables. That list only contains the nodes of my graph. Here I show a small example of my workaround:

import tensorflow as tf  

v1= tf.Variable(tf.random_normal([784, 200], stddev=0.35), name="v1")  
v2= tf.Variable(tf.zeros([200]), name="v2")  

saver = tf.train.Saver( [v2])  
# saver = tf.train.Saver()  
with tf.Session() as sess:  
    init_op = tf.global_variables_initializer()  
    sess.run(init_op)  
    saver.save(sess,"checkpoint/model_test",global_step=1)

v2 is the variable list. Or you can use variables_list = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='net') to collect all the nodes.

Using tf.data.Dataset makes saved model bigger

Edit

3 Answers3