I'm trying to write an image classification algorithm using Python and Spark.
I'm following this tutorial, which is taken from the official databricks documentation and works perfectly when running locally.
My problem now, shifting the algorithm on a cluster, is that I have to load my images from two folders on the HDFS in .jpg
format, and I can't find a way to create a dataframe the way it's done locally in the examples.
I'm looking for a substitute for this code:
from sparkdl import readImages
jobs_df = readImages(img_dir + "/jobs").withColumn("label", lit(1))