3

I have some binary files that are images and I would want to go through themselves, distributing the pixels : each node of my cluster must get the RGB of a different group of pixel(s) than another node's ones, and store these RGB into a Scala collection.

I am using SparkContext::binaryFiles but I don't know how to make Apache Spark "able to understand" that I am using an Image, that I would want to go through its pixels using distribution, and that I would want to get the RGB values. Could you help me to do that please ?

JarsOfJam-Scheduler
  • 2,809
  • 3
  • 31
  • 70

2 Answers2

12

Spark 2.3 added support for parsing images. You can read images and get meta data and image data like this:

import org.apache.spark.ml.image.ImageSchema._
import java.nio.file.Paths

val images = readImages("path/to/images")

images.foreach { rrow =>
  val row = rrow.getAs[Row](0)
  val filename = Paths.get(getOrigin(row)).getFileName().toString()
  val imageData = getData(row)
  val height = getHeight(row)
  val width = getWidth(row)

  println(s"${height}x${width}")
}

You can find some more information here

Simon
  • 6,293
  • 2
  • 28
  • 34
2

If you have the binary files, you just need to convert them into a matrix of integers(which are the RGB values). Read how to convert Images to Array of RGB in scala here :

http://otfried.org/scala/image.html

Here is an example done in Python :

Spark using PySpark read images

Sam Upra
  • 737
  • 5
  • 12
  • But this conversion won't be distributed unfortunately – JarsOfJam-Scheduler Jun 03 '17 at 13:40
  • 1
    You can use a row matrix to store these values.(which is distributed by default). https://spark.apache.org/docs/2.1.0/mllib-data-types.html – Sam Upra Jun 03 '17 at 13:48
  • Then, the operations I will do on its stored values will be distributed. But the conversion itself still won't (conversion = the image crossing pixel by pixel to store their RGB in that row matrix) – JarsOfJam-Scheduler Jun 03 '17 at 13:58
  • 1
    you can convert binary files into RDD's https://stackoverflow.com/questions/32602489/how-to-transfer-binary-file-into-rdd-in-spark. This way they're now distributed, and convert them into RGB in whichever node they're found. – Sam Upra Jun 03 '17 at 14:06
  • Hey, in this line : `>>> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata)))`, the image seems to be used to create the array only by the driver, and not distributed no ? – JarsOfJam-Scheduler Jun 03 '17 at 15:06
  • Yes, but your binary files will be.. which will make the pixels distributed by default.. and then the array will be created in the driver for each node. – Sam Upra Jun 03 '17 at 15:12
  • Mmmh. Tell me if I'm wrong but in the Python example you gave me, the conversion is indeed distributed (`binaryFiles(...)` + `.values().map(...)`). But the reading access to the image is only done by the driver : `np.asarray(...)`. Thus, is it impossible to distribute absolutely everything ? – JarsOfJam-Scheduler Jun 03 '17 at 15:21
  • Yes, that's what I meant. I couldn't find documentation for where reading is distributed, I guess unless and until Spark allows you to do this conversion or supports it for items in RDD(which I couldn't find information on), at this time it might be impossible to distrubte everything. However, this could be useful to you : https://spark.apache.org/docs/2.1.0/mllib-guide.html. If you have time, please read through it, you might find what you need. (I personally haven't read the documentation completely.) – Sam Upra Jun 03 '17 at 15:24