Spark using PySpark read images

Question

Hi there I have a lot of images (lower millions) that I need to do classification on. I am using Spark and managed to read in all the images in the format of (filename1, content1), (filename2, content2) ... into a big RDD.

images = sc.wholeTextFiles("hdfs:///user/myuser/images/image/00*")

However, I got really confused what to do with the unicode representation of the image.

Here is an example of one image/file:

(u'hdfs://NameService/user/myuser/images/image/00product.jpg', u'\ufffd\ufffd\ufffd\ufffd\x00\x10JFIF\x00\x01\x01\x01\x00`\x00`\x00\x00\ufffd\ufffd\x01\x1eExif\x00\x00II*\x00\x08\x00\x00\x00\x08\x00\x12\x01\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00\x1a\x01\x05\x00\x01\x00\x00\x00n\x00\x00\x00\x1b\x01\x05\x00\x01\x00\x00\x00v\x00\x00\x00(\x01\x03\x00\x01\x00\x00\x00\x02\x00\x00\x001\x01\x02\x00\x0b\x00\x00\x00~\x00\x00\x002\x01\x02\x00\x14\x00\x00\x00\ufffd\x00\x00\x00\x13\x02\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00i\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00GIMP 2.8.2\x00\x002013:07:29 10:41:35\x00\x07\x00\x00\ufffd\x07\x00\x04\x00\x00\x000220\ufffd\ufffd\x02\x00\x04\x00\x00\x00407\x00\x00\ufffd\x07\x00\x04\x00\x00\x000100\x01\ufffd\x03\x00\x01\x00\x00\x00\ufffd\ufffd\x00\x00\x02\ufffd\x04\x00\x01\x00\x00\x00\x04\x04\x00\x00\x03\ufffd\x04\x00\x01\x00\x00\x00X\x01\x00\x00\x05\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00\x02\x00\x01\x00\x02\x00\x04\x00\x00\x00R98\x00\x02\x00\x07\x00\x04\x00\x00\x000100\x00\x00\x00\x00\ufffd\ufffd\x04_http://ns.adobe.com/xap/1.0/\x00<?xpacket begin=\'\ufeff\' id=\'W5M0MpCehiHzreSzNTczkc9d\'?>\n<x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n <rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n  <exif:Orientation>Top-left</exif:Orientation>\n  <exif:XResolution>96</exif:XResolution>\n  <exif:YResolution>96</exif:YResolution>\n  <exif:ResolutionUnit>Inch</exif:ResolutionUnit>\n  <exif:Software>ACD Systems Digital Imaging</exif:Software>\n  <exif:DateTime>2013:07:29 10:37:00</exif:DateTime>\n  <exif:YCbCrPositioning>Centered</exif:YCbCrPositioning>\n  <exif:ExifVersion>Exif Version 2.2</exif:ExifVersion>\n  <exif:SubsecTime>407</exif:SubsecTime>\n  <exif:FlashPixVersion>FlashPix Version 1.0</exif:FlashPixVersion>\n  <exif:ColorSpace>Uncalibrated</exif:ColorSpace>\n

Looking closer, there are actually some characters look like the metadata like

...
<x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n 
<rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n  
<exif:Orientation>Top-left</exif:Orientation>\n  
<exif:XResolution>96</exif:XResolution>\n  
<exif:YResolution>96</exif:YResolution>\n  
...

My previous experience was using the package scipy and related functions like 'imread' ... and the input is usually a filename. Now I really got lost what does those unicode mean and what I can do to transform it into a format that I am familiar with.

Can anyone share with me how can I read in those unicode into a scipy image (ndarray)?

Try mapping over the RDD with imread. I think that should work. To elaborate: I'm not familiar with JPEG format, but each image becomes a file, and has a specific format you use functions like imread to simplify the manipulation of complicated image schemes. — Dair, Oct 15 '15 at 01:20
@Dair reading the source code of [imread](https://github.com/scipy/scipy/blob/v0.16.0/scipy/misc/pilutil.py#L102), it is really trying to read the image using PIL.Image given the file name, forcing imread to read unicode doesn't work. — B.Mr.W., Oct 15 '15 at 01:33

score 12 · Accepted Answer · edited Oct 15 '15 at 02:00

Your data looks like the raw bytes from a real image file (JPG?). The problem with your data is that it should be bytes, not unicode. You have to figure out how to convert from unicode to bytes. There is a whole can of worms full of encoding traps you have to deal with, but you may be lucky using img.encode('iso-8859-1'). I don't know and I will not deal with that in my answer.

The raw data for a PNG image looks like this:

rawdata = '\x89PNG\r\n\x1a\n\x00\x00...\x00\x00IEND\xaeB`\x82'

Once you have it in bytes, you can create a PIL image from the raw data, and read it as a nparray:

>>> from StringIO import StringIO
>>> from PIL import Image
>>> import numpy as np
>>> np.asarray(Image.open(StringIO(rawdata)))

array([[[255, 255, 255,   0],
    [255, 255, 255,   0],
    [255, 255, 255,   0],
    ...,
    [255, 255, 255,   0],
    [255, 255, 255,   0],
    [255, 255, 255,   0]]], dtype=uint8)

All you need to make it work on Spark is SparkContext.binaryFiles:

>>> images = sc.binaryFiles("path/to/images/")
>>> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata)))
>>> images.values().map(image_to_array)

I really like the StringIO approach which is also documented [here](http://effbot.org/imagingbook/image.htm#tag-Image.open), however, turning the weird unicode into bytes is probably the tricky part. Both 'utf-8' and 'iso-8859-1' didn't work. Voted up though :) — B.Mr.W., Oct 15 '15 at 01:51

score 10 · Answer 2 · answered Sep 11 '18 at 11:02

In Spark 2.3 or later you can use built-in Spark tools to load image data into Spark DataFrame. In 2.3

from pyspark.ml.image import ImageSchema

ImageSchema.readImages("path/to/images/")

In Spark 2.4 or later:

spark.read.format("image").load("path/to/images/")

This creates an object with following schema:

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |    |-- nChannels: integer (nullable = false)
 |    |-- mode: integer (nullable = false)
 |    |-- data: binary (nullable = false)

where image content is loaded into image.data.

At this moment this functionality is experimental, and lack required ecosystem, but should improve in the future.

Spark using PySpark read images

2 Answers2

Linked

Related