How to save tf.data.Dataset object?

Question

As above. I tried pickling it, but I got this error:

maybe_arr = self._cpu_nograd()._numpy() # pylint: disable=protected-access

RuntimeError: Tensorflow type 21 not convertible to numpy dtype.

@evolved I don't think so. I think it is a bad idea anyway. Why would you want to do it? Keep data saved in numpy format if any. — Alex Deft, Sep 03 '20 at 22:05
Thanks for your response. Using numpy format won't work for me I guess, because the data I am dealing with doesn't fit in memory. This is why I started using tf.data.Dataset. — evolved, Sep 03 '20 at 22:16
@evolved there you go, you just answered youself. The data is not going to fit in the memory, so keep it on disk. Dataset object is not going to make it fit on memory or anything. It is just a pipeline for grabbing actual data from disk. I firmly believe that you don't need to save it. — Alex Deft, Sep 03 '20 at 22:41
Yes I agree, if saving means writing all the data from the stream to disk or memory. However, I want the pipeline to be reproducible which is why I just want to save the steps how the tf.data.Dataset pipeline was created - not the data itself. This pipeline object should be smaller than the actual data. But this would require a serializable representation of the code that generated the pipeline, which is currently not implemented I guess. Hope that makes sense. — evolved, Sep 03 '20 at 22:49
@evolved I see your point. But, reporoducible is not the same as loading a saved thing. Reproducible means re-run the code used to generate that pipeline again. If there are any elements of randomness in the process, you can control them by setting seed value to fixed number. — Alex Deft, Sep 03 '20 at 22:57

score 3 · Accepted Answer · answered Dec 30 '19 at 13:38

tf.data.Dataset is a more abstract object whose job is to define the data pipeline. If you want to save intermediate results to speed up your data pipeline you can use tf.data.Dataset.cache() or tf.data.Dataset.prefetch() (more on it here)

If you are interested in saving the sequence of operations in your data pipeline, I assume there is no such thing and you need to keep the code for data pipeline. I am not aware of any method that can extract the graph of data pipeline of Dataset API. If anyone is aware of that please add to the answer.

How to save tf.data.Dataset object?

1 Answers1

Linked