Recommended Format for loading data into hadoop, for simple map reduce

Question

Currently I'm loading pretty massive data into our redshift clusters from s3 (10k rows per second or so?).

This is become a problem with trying to run any queries on the data, as even when trying to rollup a couple hours worth of data we run into out of memory errors.

What I'd like to do is run a map reduce job on the data, and just load in the aggregates. I know this is suppose to be a fairly easy task, but I'm completely new to hadoop, and I'm sorta stuck on one of the first steps.

Setup EMR Cluster (done)
Load Data into HDFS (I think this is what I'm suppose to do)

Currently all the data is being loaded into S3 gzipped JSON files (makes it easy to load into redshift). Do I have to change the file format to get it into hadoop? Each S3 File takes on something similar to this form:

{  
    "timestamp":"2015-06-10T11:54:34.345Z", 
    "key":"someguid", 
    "device": { "family" : "iOS", "versions" : "v8.4" } 
}
{
    "timestamp":"2015-06-11T15:56:44.385Z", 
    "key":"some second key", 
    "device": { "family" : "Android", "versions" : "v2.2" }
}

Where each JSON object is 1 record/row. (Notice the JSON objects are one after another, in the real files there is no whitespace, no commas separating the json objects or anything like that either).

It's not a huge deal for me to change the format of these files to something that will just work, but I'm not sure what that format would be (plain CSV files? Can I still gzip them?).

So questions are:

Is it possible to work with these files as is? If so would I have less headaches just changing them anyway
After I get my files right and I can import them, what is the easiest way to accomplish my goal of simply rolling up this data by hour, and saving the file back to S3 so I can load it into redshift? I ideally would like this job to run every hour, so my redshift tables are updated hourly with the previous hours of data. What technologies should I be reading about to accomplish this? Hive? Impala? Pig? Again, just looking for the simple solution.

To take advantages of Mapreduce, files should be splittable and as gzip is not splittable you are out of luck over here. I don't know if Redshift accepts indexed LZO or bzip2 files. But if it does, then change the file format to one of them and spin up Hadoop cluster to crunch the data using either pig or hive. If you are going to load data in partitioned format then it would help to reduce no of files scanned in MapR jobs. — D3V, Jul 19 '15 at 10:06

score 0 · Accepted Answer · edited May 23 '17 at 12:14

0

From sample data it is clear that your data is into JSON format. You can use any one from Map/Reduce, Pig, Hive to read and retrieve record.

Pig and Hive are more simpler then Map/Reduce as you don't need to write too much code.

If you are planning to read data from Hive then you can use Hive JSON sarde. More detail about implementation is available on How do you make a HIVE table out of JSON data?

If you are planning to use pig then you ca use JsonLoader during the pig load statement.You can get more detail about JsonLoader on this link http://joshualande.com/read-write-json-apache-pig/

You can also write your costume UDF in pig and hive to read JSON data.

edited May 23 '17 at 12:14

Community

1
1

answered Jul 19 '15 at 03:19

Sandeep Singh

7,790
4
43
68

Thanks for your answer, I tried messing with the serde for awhile and eventually ended up just saving the files as standard pipe delimited text files, it was much easier to get the data into hive that way. Now i'm just upset with how long it takes to run a job :(. Checking out impala instead. – Kyle Gobel Jul 20 '15 at 04:04

Recommended Format for loading data into hadoop, for simple map reduce

1 Answers1