How to train a neural network autoencoder (Keras) on Spark dataframe

Question

I created a very large Spark Dataframe with PySpark on my cluster, which is too big to fit into memory. I also have an autoencoder model with Keras, which takes in a Pandas dataframe (in-memory object).

What is the best way to bring those two worlds together?

I found some libraries that provide Deep Learning on Spark, but is seems only for hyper parameter tuning or wont support autoencoders like Apache SystemML

I am surely not the first one to train a NN on Spark Dataframes. I have a conceptual gap here, please help!

Got a similar question here. Have you got any resolution? – Veronica Cheng Apr 22 '20 at 13:17 — Veronica Cheng, Apr 22 '20 at 13:17

score 0 · Answer 1 · answered Dec 17 '20 at 14:11

As you mentioned Pandas DF in Spark are in-memory object and training won't be distributed. For distributed training you have to rely on Spark DF and some specific third party packages to handle the distributed training :

You can find the information here : https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/index.html

How to train a neural network autoencoder (Keras) on Spark dataframe

1 Answers1