2

I've tried looking for this and haven't had any meaningful results.

I have a model that has multi input and my data was getting too large for my pandas approach so I preprocessed it and saved it parquet file. I'm not sure how to open it with keras.

I looked up tf.datasets but I still cannot figure out how to read a parquet file that I can pass to my model.

Does anyone know how to use open parquet files? I can't seem to figure out how to do this in tensorflow and can't find anything related to it in keras.

robertspierre
  • 3,218
  • 2
  • 31
  • 46
Lostsoul
  • 25,013
  • 48
  • 144
  • 239

2 Answers2

2

You can probably keep your pandas approach, but you would have to breakdown your data into chunks.

If you have already broken it down to create your parquet file, you should be able to use the same method to have only a subset of your data opened in pandas at a time.

If you need to extract the data from your parquet file here's a link on how to create chunks of data for a pandas dataframe: How to read a CSV file subset by subset with Pandas?

Once you have a chunk of data you can call model.fit on that chunk of data and then go on to the next chunk and call model.fit

user14518362
  • 320
  • 4
  • 11
1

You can look into TensorFlow I/O which is a collection of file systems and file formats that are not available in TensorFlow's built-in support. Here you can find functionalities such tfio.IODataset.from_parquet, and also tfio.IOTensor.from_parquet to work with the parquet file formats.

!pip install tensorflow_io -U -q 
import tensorflow_io as tfio

df = pd.DataFrame({"data": tf.random.normal([20], 0, 1, tf.float32),
                   "label": np.random.randint(2, size=(20))})
df.to_parquet("df.parquet") 
pd.read_parquet('/content/df.parquet')[:2]
    data    label
0   0.721347    1
1   -1.215225   1

ds = tfio.IODataset.from_parquet('/content/df.parquet')
ds

FYI, I think you should also consider using the feather format rather than the parquet file format, AFAIK, the parquet file can be really heavy to load and can slow down your training pipelines, whereas feather is comparatively fast (very fast).

Innat
  • 16,113
  • 6
  • 53
  • 101