9

Trying to build simple model just to figure out how to deal with tf.data.Dataset.from_generator. I can not understand how to set output_shapes argument. I tried several combinations including not specifying it but still receive some errors due to shape mismatch of the tensors. The idea is just to yield two numpy arrays with SIZE = 10 and run linear regression with them. Here is the code:

SIZE = 10


def _generator():
    feats = np.random.normal(0, 1, SIZE)
    labels = np.random.normal(0, 1, SIZE)
    yield feats, labels


def input_func_gen():
    shapes = (SIZE, SIZE)
    dataset = tf.data.Dataset.from_generator(generator=_generator,
                                             output_types=(tf.float32, tf.float32),
                                             output_shapes=shapes)
    dataset = dataset.batch(10)
    dataset = dataset.repeat(20)
    iterator = dataset.make_one_shot_iterator()
    features_tensors, labels = iterator.get_next()
    features = {'x': features_tensors}
    return features, labels


def train():
    x_col = tf.feature_column.numeric_column(key='x', )
    es = tf.estimator.LinearRegressor(feature_columns=[x_col])
    es = es.train(input_fn=input_func_gen)

Another question is if it is possible to use this functionality to provide data for feature columns which are tf.feature_column.crossed_column? The overall goal is to use Dataset.from_generator functionality in batch training where data is loaded on chunks from a database in cases when data does not fit in memory. All opinions and examples are highly appreciated.

Thanks!

tborges
  • 117
  • 1
  • 8
Y. Boshev
  • 93
  • 1
  • 1
  • 6

1 Answers1

13

The optional output_shapes argument of tf.data.Dataset.from_generator() allows you to specify the shapes of the values yielded from your generator. There are two constraints on its type that define how it should be specified:

  • The output_shapes argument is a "nested structure" (e.g. a tuple, a tuple of tuples, a dict of tuples, etc.) that must match the structure of the value(s) yielded by your generator.

    In your program, _generator() contains the statement yield feats, labels. Therefore the "nested structure" is a tuple of two elements (one for each array).

  • Each component of the output_shapes structure should match the shape of the corresponding tensor. The shape of an array is always a tuple of dimensions. (The shape of a tf.Tensor is more general: see this Stack Overflow question for a discussion.) Let's look at the actual shape of feats:

    >>> SIZE = 10
    >>> feats = np.random.normal(0, 1, SIZE)
    >>> print feats.shape
    (10,)
    

Therefore the output_shapes argument should be a 2-element tuple, where each element is (SIZE,):

shapes = ((SIZE,), (SIZE,))
dataset = tf.data.Dataset.from_generator(generator=_generator,
                                         output_types=(tf.float32, tf.float32),
                                         output_shapes=shapes)

Finally, you will need to provide a little more information about shapes to the tf.feature_column.numeric_column() and tf.estimator.LinearRegressor() APIs:

x_col = tf.feature_column.numeric_column(key='x', shape=(SIZE,))
es = tf.estimator.LinearRegressor(feature_columns=[x_col],
                                  label_dimension=10)
mrry
  • 125,488
  • 26
  • 399
  • 400
  • Great!! but how do you do it if your feats has a size like this: feats = np.random.rand(4,2) and labels = np.random.rand(4,1). I mean, Could I feed to the estimator with this dimension and which configure should it have. thanks @mrry – Julio CamPlaz Apr 05 '18 at 13:17
  • If I try do it I have this error: ValueError: Dimensions must be equal, but are 2 and 3 for 'linear/head/labels/assert_equal/Equal' (op: 'Equal') with input shapes: [2], [3]. – Julio CamPlaz Apr 05 '18 at 13:17
  • @JulioCamPlaz I met the same problem, have you solved if feats has size? thx – crafet Aug 20 '18 at 06:32
  • @crafet yes, you cannot do it this way. You only can play with the batch size to get it goes faster – Julio CamPlaz Aug 20 '18 at 07:10
  • @crafet yes, you cannot do it this way. You only can play with the batch size to get it goes faster – Julio CamPlaz Aug 20 '18 at 07:11
  • @JulioCamPlaz if generator yield only one item, how to make it faster?thanks – crafet Aug 24 '18 at 13:07
  • 1
    @crafet I changed the batch size and it goes faster – Julio CamPlaz Aug 24 '18 at 14:31
  • @JulioCamPlaz thanks for reply. following your suggestion, it works now. still one question, there is no way to yield more than one sample using from_generator, right? at least to me, it tried many ways but failed. – crafet Aug 25 '18 at 14:38