2

I came across this below link for reading Images into Apache Spark using Python.

[Spark using PySpark read images

Below is code snippet where I am calling a describe() function. The describe() function actually computes the histograms of these images and convert those into feature vectors and store them in an .csv file which I will use later to compare with a single image to get the similarity scores .

for imagePath in glob.glob(args["dataset"] + "/*.*"):
    # extract the image ID (i.e. the unique filename) from the image
    # path and load the image itself
    imageID = imagePath[imagePath.rfind("/") + 1:]
    image = cv2.imread(imagePath)

    # describe the image
    features = cd.describe(image)

    # write the features to file
    features = [str(f) for f in features]
    output.write("%s,%s\n" % (imageID, ",".join(features)))

My only concern here is that after reading images using sc.binaryfiles(/path/images), How do I call the above cd.decribe function in such a way that it can compute indexes in parallel ?

Sorry if I sound very naive in this but I am working with Python and OpenCv for the first time.

EDIT : I am kind of very new to these serialization concepts. Let me try to simplify the question.

I am running the below spark code:

import libraries

import cv2
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
import time

Read the Image

image = cv2.imread(master_folder + "/query_shelf.png")

Create an RDD from the image

image2 = sc.parallelize(image)

Calculate the Histogram

hist = cv2.calcHist([image2], [0], None, [256], [0, 256])

this is giving error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: images is not a numpy array, neither a scalar

The above calHist() functiion works well on image since that is already a numpy array but fails to execute on image2 which is not. I guess the only way to achieve parallelism in Spark(if we can) is to run the mappers on RDDs rather than running on Python local arrays...Please suggest

Community
  • 1
  • 1
Mjas
  • 129
  • 14
  • I actually asked this question in the thread you suggested @zero323 but I was again asked to reopen a new question. I have edited this question now as per you your suggestion. Thanks – Mjas Aug 03 '16 at 10:15
  • OK, I am not sure if I understand. Do you ask if you can `sc.binaryFiles(...).map(readFile).map(cd.describe)`? If result is serializable or doesn't move around should be ok. – zero323 Aug 03 '16 at 10:50
  • Image processing according to me is one of the best use cases for using parallelism as each image is independent to each other. The function cd.describe(image) computes features of the images in a given directory. If I can put the images across nodes of an HDFS stirage system, can I code in Python in a way that spawns parallel mappers on each node and call the cd.describe(image) function in parallel and compute the feature vectors. Later we can have a reducer which can combine the results and store that in a single file. – Mjas Aug 04 '16 at 04:35
  • I can only guess what `cd.describe` is doing but long story short if you can re-write as an associative operator which uses more or less constant memory and produces serializable results then using it for reducing operation is an option. – zero323 Aug 04 '16 at 08:37
  • I have narrowed down my problem to a piece of code and edited the question above. Please see..Thanks – Mjas Aug 05 '16 at 10:33

0 Answers0