Converting Json into sequential file for hadoop

Question

I have a json file (Size 2-3 GB) stored inside HDFS. My flies look like this format

{ "DateTime" : 24-08-2015T00:00:00, "Cost":53.09,"UID":9,"Channel":"some Channel"}
{ "DateTime" : 25-08-2015T00:00:00, "Cost":54.09,"UID":8,"Channel":"some Channel2"}
{ "DateTime" : 24-08-2015T00:00:00, "Cost":56.09,"UID":7,"Channel":"some Channel3"}

I am trying to write a map reduce to convert this json files into sequential files and then read the json object. Since I need faster execution using gson and then converting it in java object will take time. I googled about it and found JAQL can do the same thing, but I didn't get any Java MR code to do it. I even didn't found maven jars for JAQL. I can't set it explicitly on my server. Is there any way to achieve it using Java code?

Ram Ghadiyaram · Answer 1 · 2017-11-27T12:42:41.837

0

I'd offer Tika
Description of this project : Apache Tika integration with Jaql using MapReduce for Hadoop

This project helps to get over the inefficiency of processing multiple small files in Hadoop using Jaql. Moreover, it allows for processing and analysis of binary documents in Hadoop using Apache Tika by integrating it in Jaql which will in turn spawn a MapReduce job. pls Check the samples

edited Nov 27 '17 at 12:42

answered May 02 '16 at 07:20

Ram Ghadiyaram

28,239
13
95
121

Converting Json into sequential file for hadoop

1 Answers1