Sharding one tfreocrd file Or split the original input dataset?

Asked Aug 05 '18 at 17:07

Active Aug 05 '18 at 17:07

Viewed 96 times

My Question is my training data is stored as one tf.record file of 333G and one epoch takes 3 hour to finish training. So what is the best way to split my data in order to improve the speed or performance on input pipeline:

Split the original dataset ( which in in CSV file) into 10 splits then create 10 tfrecord files.
Split the created one tfrecord file into multiple files through tf.Dasata.shard.If this option is better, how I should deal with shared dataset within Keras. Should I create 10 iterate? ( one iterator per each shard)?. I mean , I will not be able to save like 10 tfrecords file like option one, I will have one tfrecord file, but I just can get one shard at a time.

asked Aug 05 '18 at 17:07

W. Sam

This question seems related: https://stackoverflow.com/questions/54519309/split-tfrecords-file-into-many-tfrecords-files – xdhmoore Jan 22 '21 at 02:13

Sharding one tfreocrd file Or split the original input dataset?

0 Answers0