5

I have three different .csv datasets that I typically read using pandas and train deep learning models with. Each data is a n by m matrix where n is the number of samples and m is the number of features. After reading the data, I do some reshaping and then feed them to my deep learning model using feed_dict:

data1 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data2 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data3 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])

data = pd.concat([data1, data2, data2], axis=1)

# Some deep learning model that work with data
# An optimizer

with tf.compat.v1.Session() as sess:
     sess.run(init)
     sess.run(optimizer, feed_dict={SOME VARIABLE: data})  

However my data is too big to fit in memory now and I am wondering how can I use tf.data to read the data instead of using pandas. Sorry if the script I've provided is a pseudo-code and not my actual code.

khemedi
  • 774
  • 3
  • 9
  • 19

1 Answers1

5

Applicable to TF2.0 and above. There are a few of ways to create a Dataset from CSV files:

  1. I believe you are reading CSV files with pandas and then doing this

    tf.data.Dataset.from_tensor_slices(dict(pandaDF))

  2. You can also try this out

    tf.data.experimental.make_csv_dataset

  3. Or this

    tf.io.decode_csv

  4. Also this

    tf.data.experimental.CsvDataset

Details are here: Load CSV

If you need to do processing prior to loading with Pandas then you can follow you current approach but instead doing a pd.concat([data1, data2, data2], axis=1), use the concatentate function

data1 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data2 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data3 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C']) 

tf_dataset = tf.data.Dataset.from_tensor_slices(dict(data1))
tf_dataset = tf_dataset.concatentate(tf.data.Dataset.from_tensor_slices(dict(data2)))
tf_dataset = tf_dataset.concatentate(tf.data.Dataset.from_tensor_slices(dict(data3)))

More about concatenate

Nikhil
  • 1,126
  • 12
  • 26
  • Thanks for your answer. I am trying to use ```tf.data.experimental.make_csv_dataset``` and I can load the data from the CSV files. Do you know how to iterate over the data and get batches of data without using a For loop? I don't want to use For loop because I have three separate data sets that I like to extract batches from at the same time (inside the same iteration) – khemedi Aug 25 '21 at 17:30
  • For example, I first read the data using : ```data1_tf = tf.data.experimental.make_csv_dataset(data1_filepath, batch_size=32, label_name=None, num_epochs=10, shuffle=0, header=True)```. Then when I try to create an iterator using ```iterator = data1_tf.make_one_shot_iterator() ```, it produces this error:```*** AttributeError: 'PrefetchDataset' object has no attribute 'make_one_shot_iterator'``` – khemedi Aug 25 '21 at 17:38
  • I think you are trying the TF Version 1. You should look at the documentation for tf.data in TF Version 2 whic uses [as_numpy_iterator()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#as_numpy_iterator) – Nikhil Aug 25 '21 at 19:50
  • Also if you are worried about spending too much time on preprocessing. You can also save the `tf.data.Dataset` to file with [`tf.data.experimental.save`](https://www.tensorflow.org/api_docs/python/tf/data/experimental/save) and then load it with `tf.data.experimental.load`. – Nikhil Aug 25 '21 at 19:53