Duplicate Table in AWS Glue using AWS Athena

Question

I have a table in AWS Glue which uses an S3 bucket for it's data location. I want to execute an Athena query on that existing table and use the query results to create a new Glue table.

I have tried creating a new Glue table, pointing it to a new location in S3, and piping the Athena query results to that S3 location. This almost accomplishes what I want, but

a .csv.metadata file is put in this location along with the actual .csv output (which is read by the Glue table as it reads all files in the specified s3 location).
The csv file places double quotes around each field, which ruins any fieldSchema defined in the Glue Table that uses numbers

These services are all designed to work together, so there must be a proper way to accomplish this. Any advice would be much appreciated :)

Ilya Kisil · Accepted Answer · 2019-10-24T15:55:05.723

The way to do that is by using CTAS query statements.

A CREATE TABLE AS SELECT (CTAS) query creates a new table in Athena from the results of a SELECT statement from another query. Athena stores data files created by the CTAS statement in a specified location in Amazon S3.

For example:

CREATE TABLE new_table
WITH (
     external_location = 's3://my_athena_results/new_table_files/'
) AS (
    -- Here goes your normal query 
    SELECT 
        *
    FROM 
        old_table;
)

There are some limitations though. However, for your case the most important are:

The destination location for storing CTAS query results in Amazon S3 must be empty.
The same applies to the name of new table, i.e. it shouldn't exist in AWS Glue Data Catalog.

In general, you don't have explicit control of how many files will be created as a result of CTAS query, since Athena is a distributed system. However, can try this to use "this workaround" which uses bucketed_by and bucket_count fields within WITH clause

CREATE TABLE new_table
WITH (
    external_location = 's3://my_athena_results/new_table_files/',
    bucketed_by=ARRAY['some_column_from_select'],
    bucket_count=1
) AS (
    -- Here goes your normal query 
    SELECT 
        *
    FROM 
        old_table;
)

Apart from creating a new files and defining a table associated with you can also convert your data to a different file formats, e.g. Parquet, JSON etc.

score 1 · Answer 2 · answered Mar 07 '19 at 18:11

1

I guess you have to change ur ser-de. If you are querying csv data either opencsvserde or lazysimple serde should work for you.

answered Mar 07 '19 at 18:11

Nelson Sequiera

1,294
13
15

Duplicate Table in AWS Glue using AWS Athena

2 Answers2