OpenCV read images from pyspark and pass to a Keras model

Question

This is a follow-up question to the answer posted here. I'm using PySpark 2.4.4. I have a bunch of images (some .png some .jpeg) stored on Google Cloud Storage (GCS) that I need to pass to a Tensorflow model. I'm getting my images like this.

images = spark.read.format("image").option("dropInvalid", False).load("gs://my-bucket/my_image.jpg")
images = images.collect()
image = cv2.imdecode(np.frombuffer(images[0].image.data, np.uint8), cv2.IMREAD_COLOR)

Based on the OpenCV documentation I've read, it seems like OpenCV isn't able to understand my data format. I know this because cv2.imdecode(...) returns None. The official Spark documentation explicitly mentions compatibility with OpenCV, so I know it's possible.

Eventually I want to be able to do this.

prediction = model.predict(np.array([image]))[0]

Outside of Spark, if I get my image not from GCS but from an http endpoint, all I have to do is this, which works.

resp = urllib.request.urlopen(image_url)
image = resp.read()
prediction = model.predict(np.array([image]))[0]

To get a better sense of what the model is looking for, this is what the data should look like before it's passed into the np.array([...]) part.

print(resp.read())
>>> b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\ ...'

I can confirm that the images aren't corrupted when they're on GCS. When I download the same image from GCS to my laptop, and then read it like this, I get a similarly looking format. The model is also able to consume the image this way. I've also visually inspected the downloaded GCS image, and it looks fine.

with open("./my_image.jpeg", "rb") as image:
    print(image.read())
>>> b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\ ...'

score 0 · Answer 1 · answered Apr 24 '20 at 12:02

Not sure if this is what you are looking for, but I was able to achieve by converting PIL images to cv2 image.

Spark loading :

images = sc.binaryFiles('/tmp/images/*', 10)
df = images.map(lambda img: extract_line_coords(img)).toDF()
df.show(5, False)

Function

def extract_line_coords(binary_images):
    name, img = binary_images
    pil_image = Image.open(io.BytesIO(img)).convert('RGB') 
    cv2_image = numpy.array(pil_image) 
    cv2_image = cv2_image[:, :, ::-1].copy() 
    gray     = cv2.cvtColor(cv2_image, cv2.COLOR_BGR2GRAY)
    ...
    ...

Reference : Convert image from PIL to openCV format

OpenCV read images from pyspark and pass to a Keras model

1 Answers1