How to use tf.data in tensorflow to read .csv files?

Question

I have three different .csv datasets that I typically read using pandas and train deep learning models with. Each data is a n by m matrix where n is the number of samples and m is the number of features. After reading the data, I do some reshaping and then feed them to my deep learning model using feed_dict:

data1 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data2 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data3 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])

data = pd.concat([data1, data2, data2], axis=1)

# Some deep learning model that work with data
# An optimizer

with tf.compat.v1.Session() as sess:
     sess.run(init)
     sess.run(optimizer, feed_dict={SOME VARIABLE: data})

However my data is too big to fit in memory now and I am wondering how can I use tf.data to read the data instead of using pandas. Sorry if the script I've provided is a pseudo-code and not my actual code.

score 5 · Answer 1 · answered Aug 25 '21 at 17:17

5

Applicable to TF2.0 and above. There are a few of ways to create a Dataset from CSV files:

I believe you are reading CSV files with pandas and then doing this

tf.data.Dataset.from_tensor_slices(dict(pandaDF))
You can also try this out

tf.data.experimental.make_csv_dataset
Or this

tf.io.decode_csv
Also this

tf.data.experimental.CsvDataset

Details are here: Load CSV

If you need to do processing prior to loading with Pandas then you can follow you current approach but instead doing a pd.concat([data1, data2, data2], axis=1), use the concatentate function

data1 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data2 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data3 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C']) 

tf_dataset = tf.data.Dataset.from_tensor_slices(dict(data1))
tf_dataset = tf_dataset.concatentate(tf.data.Dataset.from_tensor_slices(dict(data2)))
tf_dataset = tf_dataset.concatentate(tf.data.Dataset.from_tensor_slices(dict(data3)))

More about concatenate

answered Aug 25 '21 at 17:17

Nikhil

1,126
12
26

Thanks for your answer. I am trying to use ```tf.data.experimental.make_csv_dataset``` and I can load the data from the CSV files. Do you know how to iterate over the data and get batches of data without using a For loop? I don't want to use For loop because I have three separate data sets that I like to extract batches from at the same time (inside the same iteration) – khemedi Aug 25 '21 at 17:30
For example, I first read the data using : ```data1_tf = tf.data.experimental.make_csv_dataset(data1_filepath, batch_size=32, label_name=None, num_epochs=10, shuffle=0, header=True)```. Then when I try to create an iterator using ```iterator = data1_tf.make_one_shot_iterator() ```, it produces this error:```*** AttributeError: 'PrefetchDataset' object has no attribute 'make_one_shot_iterator'``` – khemedi Aug 25 '21 at 17:38
I think you are trying the TF Version 1. You should look at the documentation for tf.data in TF Version 2 whic uses [as_numpy_iterator()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#as_numpy_iterator) – Nikhil Aug 25 '21 at 19:50
Also if you are worried about spending too much time on preprocessing. You can also save the `tf.data.Dataset` to file with [`tf.data.experimental.save`](https://www.tensorflow.org/api_docs/python/tf/data/experimental/save) and then load it with `tf.data.experimental.load`. – Nikhil Aug 25 '21 at 19:53

How to use tf.data in tensorflow to read .csv files?

1 Answers1