Hi there I have a lot of images (lower millions) that I need to do classification on. I am using Spark and managed to read in all the images in the format of (filename1, content1), (filename2, content2) ...
into a big RDD.
images = sc.wholeTextFiles("hdfs:///user/myuser/images/image/00*")
However, I got really confused what to do with the unicode representation of the image.
Here is an example of one image/file:
(u'hdfs://NameService/user/myuser/images/image/00product.jpg', u'\ufffd\ufffd\ufffd\ufffd\x00\x10JFIF\x00\x01\x01\x01\x00`\x00`\x00\x00\ufffd\ufffd\x01\x1eExif\x00\x00II*\x00\x08\x00\x00\x00\x08\x00\x12\x01\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00\x1a\x01\x05\x00\x01\x00\x00\x00n\x00\x00\x00\x1b\x01\x05\x00\x01\x00\x00\x00v\x00\x00\x00(\x01\x03\x00\x01\x00\x00\x00\x02\x00\x00\x001\x01\x02\x00\x0b\x00\x00\x00~\x00\x00\x002\x01\x02\x00\x14\x00\x00\x00\ufffd\x00\x00\x00\x13\x02\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00i\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00`\x00\x00\x00\x01\x00\x00\x00GIMP 2.8.2\x00\x002013:07:29 10:41:35\x00\x07\x00\x00\ufffd\x07\x00\x04\x00\x00\x000220\ufffd\ufffd\x02\x00\x04\x00\x00\x00407\x00\x00\ufffd\x07\x00\x04\x00\x00\x000100\x01\ufffd\x03\x00\x01\x00\x00\x00\ufffd\ufffd\x00\x00\x02\ufffd\x04\x00\x01\x00\x00\x00\x04\x04\x00\x00\x03\ufffd\x04\x00\x01\x00\x00\x00X\x01\x00\x00\x05\ufffd\x04\x00\x01\x00\x00\x00\ufffd\x00\x00\x00\x00\x00\x00\x00\x02\x00\x01\x00\x02\x00\x04\x00\x00\x00R98\x00\x02\x00\x07\x00\x04\x00\x00\x000100\x00\x00\x00\x00\ufffd\ufffd\x04_http://ns.adobe.com/xap/1.0/\x00<?xpacket begin=\'\ufeff\' id=\'W5M0MpCehiHzreSzNTczkc9d\'?>\n<x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n <rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n <exif:Orientation>Top-left</exif:Orientation>\n <exif:XResolution>96</exif:XResolution>\n <exif:YResolution>96</exif:YResolution>\n <exif:ResolutionUnit>Inch</exif:ResolutionUnit>\n <exif:Software>ACD Systems Digital Imaging</exif:Software>\n <exif:DateTime>2013:07:29 10:37:00</exif:DateTime>\n <exif:YCbCrPositioning>Centered</exif:YCbCrPositioning>\n <exif:ExifVersion>Exif Version 2.2</exif:ExifVersion>\n <exif:SubsecTime>407</exif:SubsecTime>\n <exif:FlashPixVersion>FlashPix Version 1.0</exif:FlashPixVersion>\n <exif:ColorSpace>Uncalibrated</exif:ColorSpace>\n
Looking closer, there are actually some characters look like the metadata like
...
<x:xmpmeta xmlns:x=\'adobe:ns:meta/\'>\n<rdf:RDF xmlns:rdf=\'http://www.w3.org/1999/02/22-rdf-syntax-ns#\'>\n\n
<rdf:Description xmlns:exif=\'http://ns.adobe.com/exif/1.0/\'>\n
<exif:Orientation>Top-left</exif:Orientation>\n
<exif:XResolution>96</exif:XResolution>\n
<exif:YResolution>96</exif:YResolution>\n
...
My previous experience was using the package scipy and related functions like 'imread' ... and the input is usually a filename. Now I really got lost what does those unicode mean and what I can do to transform it into a format that I am familiar with.
Can anyone share with me how can I read in those unicode into a scipy image (ndarray)?