Reading Avro container files in Spark

Question

I am working on a scenario where I need to read Avro container files from HDFS and do analysis using Spark.

Input Files Directory: hdfs:///user/learner/20151223/.lzo*

Note : The Input Avro Files are lzo compressed.

val df = sqlContext.read.avro("/user/learner/20151223/*.lzo");

When I run the above command.It throws an error :

java.io.FileNotFoundException: No avro files present at file:/user/learner/20151223/*.lzo
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:225)

This make sense,because the method read.avro() is expecting .avro extension files as input.So I extract and rename the input .lzo file to .avro.I am able to read the data in avro file properly.

Is there any way to read lzo compressed Avro files in spark ?

Solution worked, But !

I have found a way to solve this issue. I created a shell wrapper in which I have decompressed the .lzo into .avro file format using following way:

hadoop fs -text <file_path>*.lzo | hadoop fs - put - <file_path>.avro

I am successfull in decompressing lzo files but the problem is I am having atleast 5000 files in compressed format.Uncompressing and Converting one by one is taking nearly 1+ hours to run this Job.

Is there any way to do this Decompression in bulk way ?

Thanks again !

Reading Avro container files in Spark

0 Answers0