0

I have successfully read two csv files using make_csv_dataset within a input_fn() and passed that into a tf.estimator.

I did it by first splitting the main csv into two separate frames, one for training and one for testing, and then saving them as new csv files.

train, test = train_test_split(df, test_size = 0.2)

train_csv_path = 'data/2020_train.csv.gz'
test_csv_path = 'data/2020_test.csv.gz'

train.to_csv(train_csv_path, compression = 'gzip')
test.to_csv(test_csv_path, compression = 'gzip')
def make_input_fn(csv_path, n_epochs = None):
    def input_fn():
        dataset = tf.data.experimental.make_csv_dataset(csv_path,
                                                        batch_size = 1000,
                                                        label_name = 'Shipped On SSD',
                                                        compression_type = 'GZIP',
                                                        num_epochs = n_epochs)
        return dataset
    return input_fn

train_input_fn = make_input_fn(train_csv_path)
test_input_fn = make_input_fn(test_csv_path, n_epochs = 1)

However, I want to use just one file and do the split on the dataset.

I can successfully split a dataset (like this), but the issue occurs when passing it to an tf.estimator. I can’t figure out how to use a dataset defined outside a input_fn() or how to do the splitting within a input_fn().

dataset = tf.data.experimental.make_csv_dataset(full_csv_path,
                                                batch_size = 1000,
                                                label_name = 'Shipped On SSD',
                                                compression_type = 'GZIP')

split = 4
split_fn = lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds)

dataset_train = dataset.window(split, split + 1).flat_map(split_fn)
dataset_test = dataset.skip(split).window(1, split + 1).flat_map(split_fn)

1 Answers1

1

You can wrap your dataset creation in a function. Unfortunately, that function will read the csv twice, once for each set.

def make_input_fn_from_ds(ds, training=True):
  def input_fn():
    dataset = tf.data.experimental.make_csv_dataset(full_csv_path,
                                                batch_size = 1000,
                                                label_name = 'Shipped On SSD',
                                                compression_type = 'GZIP')

    split = 4
    split_fn = lambda *ds: ds[0] if len(ds) == 1 else   tf.data.Dataset.zip(ds)

    if training:
      return dataset_train = dataset.window(split, split + 1).flat_map(split_fn)
    return dataset_test = dataset.skip(split).window(1, split + 1).flat_map(split_fn)
  return input_fn

train_input_fn = make_input_fn_from_ds(dataset_train, training=True)
test_input_fn = make_input_fn_from_ds(dataset_train, training=False)
Lescurel
  • 10,749
  • 16
  • 39
  • That gives the following error: The graph () of the iterator is different from the graph () the dataset: tf.Tensor(, shape=(), dtype=variant) was created in. If you are using the Estimator API, make sure that no part of the dataset returned by the `input_fn` function is defined outside the `input_fn` function. Please ensure that all datasets in the pipeline are created in the same graph as the iterator. – Mattias Thalén Feb 10 '21 at 12:51
  • 1
    Oh, I forgot about that limitation. Let me edit my answer. – Lescurel Feb 10 '21 at 13:03
  • Thanks! Should I turn off shuffling so I don't run the risk of including training dataset into the test dataset? – Mattias Thalén Feb 10 '21 at 13:53
  • 1
    Yes, shuffling should be avoided. Or you can make it deterministic with a call to `tf.random.set_seed` before the shuffle and pass `reshuffle_each_iteration=False` to the `shuffle` method. – Lescurel Feb 10 '21 at 13:56