I have successfully read two csv files using make_csv_dataset within a input_fn()
and passed that into a tf.estimator
.
I did it by first splitting the main csv into two separate frames, one for training and one for testing, and then saving them as new csv files.
train, test = train_test_split(df, test_size = 0.2)
train_csv_path = 'data/2020_train.csv.gz'
test_csv_path = 'data/2020_test.csv.gz'
train.to_csv(train_csv_path, compression = 'gzip')
test.to_csv(test_csv_path, compression = 'gzip')
def make_input_fn(csv_path, n_epochs = None):
def input_fn():
dataset = tf.data.experimental.make_csv_dataset(csv_path,
batch_size = 1000,
label_name = 'Shipped On SSD',
compression_type = 'GZIP',
num_epochs = n_epochs)
return dataset
return input_fn
train_input_fn = make_input_fn(train_csv_path)
test_input_fn = make_input_fn(test_csv_path, n_epochs = 1)
However, I want to use just one file and do the split on the dataset.
I can successfully split a dataset (like this), but the issue occurs when passing it to an tf.estimator
. I can’t figure out how to use a dataset defined outside a input_fn()
or how to do the splitting within a input_fn()
.
dataset = tf.data.experimental.make_csv_dataset(full_csv_path,
batch_size = 1000,
label_name = 'Shipped On SSD',
compression_type = 'GZIP')
split = 4
split_fn = lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds)
dataset_train = dataset.window(split, split + 1).flat_map(split_fn)
dataset_test = dataset.skip(split).window(1, split + 1).flat_map(split_fn)