Currently I'm loading pretty massive data into our redshift clusters from s3 (10k rows per second or so?).
This is become a problem with trying to run any queries on the data, as even when trying to rollup a couple hours worth of data we run into out of memory errors.
What I'd like to do is run a map reduce job on the data, and just load in the aggregates. I know this is suppose to be a fairly easy task, but I'm completely new to hadoop, and I'm sorta stuck on one of the first steps.
- Setup EMR Cluster (done)
- Load Data into HDFS (I think this is what I'm suppose to do)
Currently all the data is being loaded into S3 gzipped JSON files (makes it easy to load into redshift). Do I have to change the file format to get it into hadoop? Each S3 File takes on something similar to this form:
{
"timestamp":"2015-06-10T11:54:34.345Z",
"key":"someguid",
"device": { "family" : "iOS", "versions" : "v8.4" }
}
{
"timestamp":"2015-06-11T15:56:44.385Z",
"key":"some second key",
"device": { "family" : "Android", "versions" : "v2.2" }
}
Where each JSON object is 1 record/row. (Notice the JSON objects are one after another, in the real files there is no whitespace, no commas separating the json objects or anything like that either).
It's not a huge deal for me to change the format of these files to something that will just work, but I'm not sure what that format would be (plain CSV files? Can I still gzip them?).
So questions are:
- Is it possible to work with these files as is? If so would I have less headaches just changing them anyway
- After I get my files right and I can import them, what is the easiest way to accomplish my goal of simply rolling up this data by hour, and saving the file back to S3 so I can load it into redshift? I ideally would like this job to run every hour, so my redshift tables are updated hourly with the previous hours of data. What technologies should I be reading about to accomplish this? Hive? Impala? Pig? Again, just looking for the simple solution.