0

I'm trying to write an image classification algorithm using Python and Spark.
I'm following this tutorial, which is taken from the official databricks documentation and works perfectly when running locally.

My problem now, shifting the algorithm on a cluster, is that I have to load my images from two folders on the HDFS in .jpg format, and I can't find a way to create a dataframe the way it's done locally in the examples.

I'm looking for a substitute for this code:

from sparkdl import readImages
jobs_df = readImages(img_dir + "/jobs").withColumn("label", lit(1))
Shaido
  • 27,497
  • 23
  • 70
  • 73
Andrea
  • 369
  • 1
  • 3
  • 10

1 Answers1

0

It should be pretty much same as reading the files from Local.

Below is implementation from the library. It internally uses binaryFiles api to load binary files. The API documentation (binaryFiles) says it supports Hadoop filesystem too.

 rdd = sc.binaryFiles(path, minPartitions=numPartitions).repartition(numPartitions) 

Hope this helps.

nkasturi
  • 181
  • 1
  • Yes, I this goes in the direction i need. I think i can then get a dataframe out of this rdd with the solution provided in this other stackoverflow thread: https://stackoverflow.com/questions/39699107/spark-rdd-to-dataframe-python. Thanks a lot. – Andrea Dec 19 '17 at 17:22