I came across this below link for reading Images into Apache Spark using Python.
[Spark using PySpark read images
Below is code snippet where I am calling a describe() function. The describe() function actually computes the histograms of these images and convert those into feature vectors and store them in an .csv file which I will use later to compare with a single image to get the similarity scores .
for imagePath in glob.glob(args["dataset"] + "/*.*"):
# extract the image ID (i.e. the unique filename) from the image
# path and load the image itself
imageID = imagePath[imagePath.rfind("/") + 1:]
image = cv2.imread(imagePath)
# describe the image
features = cd.describe(image)
# write the features to file
features = [str(f) for f in features]
output.write("%s,%s\n" % (imageID, ",".join(features)))
My only concern here is that after reading images using sc.binaryfiles(/path/images), How do I call the above cd.decribe function in such a way that it can compute indexes in parallel ?
Sorry if I sound very naive in this but I am working with Python and OpenCv for the first time.
EDIT : I am kind of very new to these serialization concepts. Let me try to simplify the question.
I am running the below spark code:
import libraries
import cv2
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
import time
Read the Image
image = cv2.imread(master_folder + "/query_shelf.png")
Create an RDD from the image
image2 = sc.parallelize(image)
Calculate the Histogram
hist = cv2.calcHist([image2], [0], None, [256], [0, 256])
this is giving error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: images is not a numpy array, neither a scalar
The above calHist() functiion works well on image since that is already a numpy array but fails to execute on image2 which is not. I guess the only way to achieve parallelism in Spark(if we can) is to run the mappers on RDDs rather than running on Python local arrays...Please suggest