2

Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data

We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.

Or would that org-tenant-company-id be a partition?

is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
user26270
  • 6,904
  • 13
  • 62
  • 94
  • Your plan sounds viable to me. Can you share the create table statement that didn't work? – James Feb 09 '17 at 17:07
  • 1
    The initial create table statement appeared to work; at least it didn't error; but it didn't load any data either; I tried it again and I think the problem was a column that didn't exist; now I'd like to partition it along that org-tenant-company-id sub-directory, but that's probably a separate question, unless you know it – user26270 Feb 09 '17 at 19:39

2 Answers2

3

You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:

CREATE EXTERNAL TABLE mydb.mytable (
    id int,
    stuff string,
    ...
)
PARTITIONED BY (
    orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';

After creating the table, you can then load individual partitions:

ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';

Result rows will contain the partition fields like orgtenantcompanyid.

James
  • 11,721
  • 2
  • 35
  • 41
  • Thanks, but I think we need the partitioning to be done automatically. Wouldn't want to have to maintain the partitions manually. it looks like to do that we'd have to change our subdirectories to be in a 'key=value' format for them to be recognized as partitions automatically – user26270 Feb 10 '17 at 15:55
  • Won't you still need to call `MSCK REPAIR TABLE mydb.mytable`? [Scenario 1: Data already partitioned and stored on S3 in hive format](http://docs.aws.amazon.com/athena/latest/ug/partitions.html#scenario-1-data-already-partitioned-and-stored-on-s3-in-hive-format) – James Feb 10 '17 at 17:46
  • I don't know, it's not clear from there whether you need to keep calling that command as further partitions are created, or whether it's a one-time thing. – user26270 Feb 10 '17 at 18:25
  • I had to keep calling it. But it does auto-discover available partitions, which might be nice if you are adding a number of them. – James Feb 10 '17 at 18:56
1

Yes, it is possible to create tables that only use contents of a specific subdirectory.

It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.

To check whether you can acutally query the data do something like:

SELECT * FROM <table_name> LIMIT 10

Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.

And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55