read and write images in hdfs through spark

Question

Hi I am trying to read a images file from the local file system and store it in HDFS file system through spark and scala.

Here is mycode.

val streams = spark.sparkContext.wholeTextFiles("file:///home/jeffi/input/Images_Test/")
val op = streams.toDF()  //op: org.apache.spark.sql.DataFrame = [_1: string, _2: string]
op.printSchema() //root |-- _1: string (nullable = true) |-- _2: string (nullable = true)

I tried to write the op dataframe in to HDFS, Then I got the following exception

 op.write.text("/home/cisadmin/image_op")

org.apache.spark.sql.AnalysisException: Text data source supports only a single column, and you have 2 columns.;

I tried with various types in write method like op.write,op.write.wholeTextFiles("")

Nothing works for me. Any help would be appreciated.

why are you reading image files with text apis ? look at https://stackoverflow.com/questions/44343919/scala-and-spark-how-to-go-through-an-image — Ramesh Maharjan, Mar 19 '18 at 05:21
may be https://github.com/Microsoft/spark-images/blob/master/src/main/scala/org/apache/spark/image/ImageSchema.scala will help you — Ramesh Maharjan, Mar 19 '18 at 05:25
follow them and try it and if they don't work then update the question. we shall help you — Ramesh Maharjan, Mar 19 '18 at 05:31

score 0 · Answer 1 · answered Mar 19 '18 at 05:48

Regarding your error, If you check text method it says,

Saves the content of the [[DataFrame]] in a text file at the specified path.
The DataFrame must have only one column that is of string type.
Each row becomes a new line in the output file.

But in your case op has two columns, so either you can save your file as csv or convert it to RDD then save it as a text file.

But as Ramesh Maharjan mentioned you should not use text APIs for reading image files.

Got it. I am modifying my code. let you know If I face any issues — Teju Priya, Mar 19 '18 at 06:35

read and write images in hdfs through spark

1 Answers1