0

I'm running a hvie script on EMR that's pulling data out of s3 keys. I can get all the data and put it in a table just fine. The problem is, some of the data I need is in the key name. How do I get the key name from within hive and put that into the hive table?

Jeff
  • 6,932
  • 7
  • 42
  • 72

1 Answers1

1

I faced similar problem recently. From what I researched, it depends. You can get the data out of the "directory" part but not the "filename" part of s3 keys.

You can use partition if s3 keys are formatted properly. partition can be queried the same way as columns. here is a link with some examples: Loading data with Hive, S3, EMR, and Recover Partitions

You can also specify the partitions yourself if s3 files are already grouped properly. For example I needed the date information so my script looked like this:

create external table Example(Id string, PostalCode string, State string)
    partitioned by (year int, month int, day int) 
    row format delimited fields terminated by ',' 
    tblproperties ("skip.header.line.count"="1");

alter table Example add partition(year=2014,month=8,day=1) location 's3n://{BuckeyName}/myExampledata/2014/08/01/';

alter table Example add partition(year=2014,month=8,day=2) location 's3n://{BuckeyName}/myExampledata/2014/08/02/';
...keep going

The partition data must be part of the "directory name" and not the "filename" because Hive loads data from a directory.

If you need to read some text out of the file name, I think you have to create custom program to rename the objects to so that the text you need is in the "directory name".

Good luck!

Community
  • 1
  • 1
Tzu
  • 235
  • 1
  • 9