Image dataframe from HDFS for Image Classification

Question

I'm trying to write an image classification algorithm using Python and Spark.
I'm following this tutorial, which is taken from the official databricks documentation and works perfectly when running locally.

My problem now, shifting the algorithm on a cluster, is that I have to load my images from two folders on the HDFS in .jpg format, and I can't find a way to create a dataframe the way it's done locally in the examples.

I'm looking for a substitute for this code:

from sparkdl import readImages
jobs_df = readImages(img_dir + "/jobs").withColumn("label", lit(1))

you are looking for a substitute that does what exactly? I don't think I understand your question. — Tadhg McDonald-Jensen, Dec 18 '17 at 19:37
What is `img_dir` set to? Can you show us the HDFS path content? Is Spark configured to read HDFS, not local disk? — OneCricketeer, Dec 19 '17 at 14:21

score 0 · Accepted Answer · answered Dec 18 '17 at 19:39

0

It should be pretty much same as reading the files from Local.

Below is implementation from the library. It internally uses binaryFiles api to load binary files. The API documentation (binaryFiles) says it supports Hadoop filesystem too.

 rdd = sc.binaryFiles(path, minPartitions=numPartitions).repartition(numPartitions)

Hope this helps.

answered Dec 18 '17 at 19:39

nkasturi

181
1

Yes, I this goes in the direction i need. I think i can then get a dataframe out of this rdd with the solution provided in this other stackoverflow thread: https://stackoverflow.com/questions/39699107/spark-rdd-to-dataframe-python. Thanks a lot. – Andrea Dec 19 '17 at 17:22

Image dataframe from HDFS for Image Classification

1 Answers1