Let's say you want to processes 4 month of data with CTAS queries, but you need to partition it by day. If you do it in a single CTAS query you will end up with roughly 4 x 30 = 120 partitions, thus, query will fail as you mention due to AWS limitations.
Instead, you can process you data for each month at a time so you will be guarantied to have less then 31 partition at a time. However, the result of each CTAS query should have a unique external location on S3, i.e. if you want to store results of multiple CTAS queries under s3://bukcet-name/data-root
you would need to extend this path for each query in external_location
under WITH
clause. The obvious choice for your case would be full date, for example:
s3://bukcet-name/data-root
├──2019-01 <-- external_location='s3://bukcet-name/data-root/2019-01'
| └── month=01
| ├── day=01
| | ...
| └── day=31
├──2019-02 <-- external_location='s3://bukcet-name/data-root/2019-02'
| └── month=02
| ├── day=01
| | ...
| └── day=28
...
However, now you ended up with 4 different tables. So you either need to query different tables, or you have to do some postprocessing. Essentially, you would have two options
Move all new files into a common place with AWS CLI high-level commands which should be followed by MSCK REPAIR TABLE
since output "directory" structure adheres HIVE partitioning naming convention. For example from
s3://bukcet-name/data-staging-area
├──2019-01 <-- external_location='s3://bukcet-name/data-staging-area/2019-01'
| └── month=01
| ├── day=01
| | ...
your would copy into
s3://bukcet-name/data-root
├── month=01
| ├── day=01
| | ...
| └── day=31
├── month=02
| ├── day=01
| | ...
| └── day=28
Manipulate with AWS Glue Data Catalog. This is a little bit more trickier, but the main idea is that you define a root table with location pointing to s3://bukcet-name/data-root
. Then after executing CTAS query, you would need to copy meta-information about partitions from created "staging" table into the root table. This step would be based on AWS Glue API via for example boto3
library for Python. In particular, you would use get_partitions()
and batch_create_partition()
methods.
Regardless which approach you choose, you would need use some sort of job scheduling software, especially since your data is not just historical. I would suggest to use Apache Airflow for that. It can be seen as an alternative to a combination of Lambda and Step Functions, it is totally free. There are plenty blog posts and documentation that can help you get started. For example:
- Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
- Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
Things to keep in mind:
In general, you don't have explicit control of how many files will be created as a result of CTAS query, since Athena is a distributed system. On the other hand, you don't want to have a lot of small files. So can try this to use "this workaround" which uses bucketed_by
and bucket_count
fields within WITH
clause
CREATE TABLE new_table
WITH (
...
bucketed_by=ARRAY['some_column_from_select'],
bucket_count=1
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
Alternatively, reduce number of partitions, i.e. stop at month level.