Spark: When reading tif images dataframe only contains rows with empty byte arrays

Question

I'm trying to process multiple folders with 810 seperate tif files.

Folder structure:

Upon trying to create a dataframe for this I'm running into the issue that the loaded bytearrays are empty. And I obviously need those for processing.

Dataframe creation:

spark = SparkSession \
    .builder \
    .appName(name) \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .getOrCreate()
file_rdd = spark.read.format('image').load(argv[1] + '/' + '*/*')

Argv obviously contains the base folder as the first parameter. When debugging (via debugger or printing) I noticed that my dataframe is a bunch of rows that only have the origin set, and all the other values are either -1 or empty.

I mainly need the byte array to be filled in, as well as an origin. Although, when observing the memory used on my system there is an obvious spike, indicating that it definitely loading something.

Am I doing something wrong or unsupported?

Hristo Iliev · Accepted Answer · 2020-04-21T23:09:07.977

1

The -1s mean that the corresponding images are invalid. If you add the dropInvalid option and set it to True, those will probably not be present at all.

Spark uses Java's ImageIO library to read images. ImageIO make use of plug-ins to support different image formats. Java versions up to 8 only come with plug-ins for JPEG, PNG, BMP, WBMP, and GIF. Java 9 adds a standard plug-in for TIFF. Since Spark officially supports Java 8 only, your options is to use a 3rd party TIFF plug-in for ImageIO, for example this one provided by a fellow Stack Overflow user.

To use the aforementioned plug-in, add something like this to the Spark session configuration:

.config("spark.jars.packages", "com.twelvemonkeys.imageio:imageio-tiff:3.5,com.twelvemonkeys.imageio:imageio-core:3.5") \

You can track the package versions in the Maven Index.

edited Apr 21 '20 at 23:09

answered Apr 21 '20 at 22:32

Hristo Iliev

72,659
12
135
186

Sadly if I were to drop the invalid ones, I would end up with none left. Your second note about the libraries does make sense. Sadly my spark project is in python. Do you think adding --jars path/to/imagio/lib to the spark-submit call would autoload the jar into the jvm and therefor give me the ability to read a TIF? – MrKickkiller Apr 21 '20 at 22:40
1

You can directly pass Maven package coordinates in `spark.jars.packages`. Adding the package or its JARs should suffice - ImageIO plug-ins are resolved automatically. – Hristo Iliev Apr 21 '20 at 22:43
First off, I'm impressed pyspark is that simple to add java dependencies. However upon adding your suggested line, now the class javax.imageio.ImageIO could not be initialized. I noticed that it did not download the imageio core thingy from maven. Could that be it? – MrKickkiller Apr 21 '20 at 22:50
1

Apparently, it also needs the `imageio-core` package. I'm updating the answer right now. – Hristo Iliev Apr 21 '20 at 23:08
My Maven-foo is weak, but in the POM file, the `imageio-core` dependency is present twice and the second time is marked as test dependency. Perhaps the resolver skips it because of this, so it needs to be added manually. – Hristo Iliev Apr 21 '20 at 23:20

Spark: When reading tif images dataframe only contains rows with empty byte arrays

1 Answers1