4

I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE statement to a single CSV.

The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files

CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS 
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa

How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
user3490715
  • 51
  • 1
  • 1
  • 5

2 Answers2

10

If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.

CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid 
WITH (
      format = 'TEXTFILE',
      field_delimiter = ',', 
      external_location = 's3://my_athena_results/ctas_query_result_bucketed/', 
      bucketed_by = ARRAY['__SOME_COLUMN__'], 
      bucket_count = 1) 
AS (
    SELECT *
    FROM targetsmart_idl_data_pa_mi_deduped_aaid
    UNION ALL
    SELECT *
    FROM targetsmart_idl_data_pa_mi_deduped_idfa
);

Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].

See Athena documentation for more information on its syntax and what can be specified within WITH directive. Also, don't forget about considerations and limitations for CTAS Queries, e.g. the external_location for storing CTAS query results in Amazon S3 must be empty etc.

Update 2019-08-13

  • Apparently, the result of CTAS statements are compressed with GZIP algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv file extension, but you still will be able to open them with text editors.

Update 2019-08-14

  • You wont' be able to preserve column names inside files if you save them in csv format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.

  • If you want to preserve column names in the output files after executing CTAS queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format property within WITH clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.

Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
  • Thank you - this is amazing! One last question: I am struggling to preserve column names? Thanks! – user3490715 Aug 14 '19 at 03:24
  • Not sure about your use case, but `CTAS` queries are normally used to improve query performance by doing data conversion, selecting a subset of data or other manipulations with dataset so not to do it all the time. Why do you want to create a new table? Would a regular query be enough? Nevertheless, see updated answer. – Ilya Kisil Aug 14 '19 at 10:14
  • _"even explicitly specifying a bucket size of one, might still get multiple files"_ -> Is this still the case? In my experience, setting `bucket_count = 1` gets you exactly one file. And the documentation clearly states [bucket_count: number of files that you want](https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/) – Henrik Koberg Apr 07 '22 at 08:03
1

Athena stores query results in Amazon S3. A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)

Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call. (https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)

P. Phalak
  • 457
  • 1
  • 4
  • 11