I am trying to design an input pipeline with Dataset API. I am working with parquet files. What is a good way to add them to my pipeline?
Asked
Active
Viewed 6,362 times
2 Answers
7
We have released Petastorm, an open source library that allows you to use Apache Parquet files directly via Tensorflow Dataset API.
Here is a small example:
with Reader('hdfs://.../some/hdfs/path') as reader:
dataset = make_petastorm_dataset(reader)
iterator = dataset.make_one_shot_iterator()
tensor = iterator.get_next()
with tf.Session() as sess:
sample = sess.run(tensor)
print(sample.id)

Yevgeni Litvin
- 450
- 4
- 8
0
Maybe a little late, but looks like this is available directly within Tensorflow now.
https://www.tensorflow.org/io/api_docs/python/tfio/experimental/IODataset#from_parquet

mastDrinkNimbuPani
- 1,249
- 11
- 14