This is a follow-up question to the answer posted here. I'm using PySpark 2.4.4. I have a bunch of images (some .png some .jpeg) stored on Google Cloud Storage (GCS) that I need to pass to a Tensorflow model. I'm getting my images like this.
images = spark.read.format("image").option("dropInvalid", False).load("gs://my-bucket/my_image.jpg")
images = images.collect()
image = cv2.imdecode(np.frombuffer(images[0].image.data, np.uint8), cv2.IMREAD_COLOR)
Based on the OpenCV documentation I've read, it seems like OpenCV isn't able to understand my data format. I know this because cv2.imdecode(...)
returns None
. The official Spark documentation explicitly mentions compatibility with OpenCV, so I know it's possible.
Eventually I want to be able to do this.
prediction = model.predict(np.array([image]))[0]
Outside of Spark, if I get my image not from GCS but from an http endpoint, all I have to do is this, which works.
resp = urllib.request.urlopen(image_url)
image = resp.read()
prediction = model.predict(np.array([image]))[0]
To get a better sense of what the model is looking for, this is what the data should look like before it's passed into the np.array([...])
part.
print(resp.read())
>>> b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\ ...'
I can confirm that the images aren't corrupted when they're on GCS. When I download the same image from GCS to my laptop, and then read it like this, I get a similarly looking format. The model is also able to consume the image this way. I've also visually inspected the downloaded GCS image, and it looks fine.
with open("./my_image.jpeg", "rb") as image:
print(image.read())
>>> b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\ ...'