33

Is it possible to delete data stored in S3 through an Athena query? I have some rows I have to delete from a couple of tables (they point to separate buckets in S3).

I couldn't find a way to do it in the Athena User Guide: https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf and DELETE FROM isn't supported, but I'm wondering if there is an easier way than trying to find the files in S3 and deleting them.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Guillermo Mirandes
  • 435
  • 1
  • 5
  • 8

7 Answers7

41

You can leverage Athena to find out all the files that you want to delete and then delete them separately. There is a special variable "$path".

Select "$path" from <table> where <condition to get row of files to delete>

To automate this, you can have iterator on Athena results and then get filename and delete them from S3.

Dhaval
  • 1,046
  • 6
  • 10
  • 1
    This is cool, thanks for sharing, but I can't delete the entire file, I need to delete specific lines in the files with the bad data. – Guillermo Mirandes Feb 20 '18 at 19:06
  • 4
    Athena does not have that support as of now. Also i don't feel that it would fall into Athena's charter as it is just an analysis engine on data stored somewhere. – Dhaval Feb 22 '18 at 04:01
  • 1
    This works side-by-side with a python lambda and using python to query athena retrieve these locations and then, assuming files fit within a lambda 10GB limit, pandas to read csv/json/parquet files in and rewrite them back out to the same location minus the records to delete. If you could deal with single item record, high partitioning, set a soft delete column at the end and then updating that flag also works with a similar process also works – vfrank66 Jan 30 '22 at 06:34
7

I also would like to add that after you find the files to be updated you can filter the rows you want to delete, and create new files using CTAS: https://docs.aws.amazon.com/athena/latest/ug/ctas.html

Later you can replace the old files with the new ones created by CTAS. I think it is the most simple way to go

Ori N
  • 555
  • 10
  • 22
6

The answer is Yes, now you can delete the data from Athena, recently AWS has introduced ICEBERG table which supports the ACID property.

You need to create an iceberg table that will have the same data as your Athena table(where you want to delete/update records) using the below steps.

Create ICEBERG TABLE

    Create table new_iceberg_table 
    (id double, name string)
    LOCATION 'S3://path/where/you/want/to_save/'
    TBLPROPERTIES (table_type='iceberg')

Load data from your Data Catalogue into this new iceberg table.

   Insert into datasource.new_iceberg_table
   Select * from datasource.main_athena_table.

main_athena_table = Table where you want to perform Delete/Update or ACID. new_iceberg_table = Newly created table

Now you can insert, update, and delete the data from iceberg table.

You can also time travel using SYSTEM_TIME.

Relevant SQL's

#Update SQL =

UPDATE from datasource.new_iceberg_table set id = 04 where name='ABC' ;

#Delete SQL =

DELETE from datasource.new_iceberg_table where name='ABC' ;

#Time travel SQL (In case you want to time travel and want to see the older data)

SELECT * from datasource.new_iceberg_table for SYSTEM_TIME as of (current_timestamp - interval '10' minutes) where name='ABC'

Thank you.

4

I would just like to add to Dhaval's answer.

You can find out the path of the file with the rows that you want to delete and instead of deleting the entire file, you can just delete the rows from the S3 file which I am assuming would be in the Json format.

The process is to download the particular file which has those rows, remove the rows from that file and upload the same file to S3.

This just replaces the original file with the one with modified data (in your case, without the rows that got deleted). After the upload, Athena would tranform the data again and the deleted rows won't show up.

4

Use AWS Glue for that. Load your data, delete what you need to delete, save the data back.

Now you can also delete files from s3 and merge data: https://aws.amazon.com/about-aws/whats-new/2020/01/aws-glue-adds-new-transforms-apache-spark-applications-datasets-amazon-s3/

Stefan
  • 747
  • 8
  • 11
0

AWS has announced general availability of Iceberg integration with Athena and Athena now support DMLs at raw level for Iceberg tables.

UPDATE and DELETE rows can be done using SQLs:

DELETE FROM [db_name.]table_name [WHERE predicate]

UPDATE [db_name.]table_name SET xx=yy[,...] [WHERE predicate]

For more details - AWS DOCUMENT

Note that Athena Iceberg integration now is very restrictive - nested SQLs for Deletes and Updates are NOT supported. Below query won't work:

Delete from table1 where uniqueid in (select b.uniqueid from delete_staging b)

faisal_kk
  • 1,039
  • 1
  • 11
  • 19
0

Shivendra Singh's answer about ICEBERG should be accepted, as ICEBERG seems to answer all needs now. But if you need to stay on the Hive table, or if your files format is JSON and you need to keep it this way, you have the following option:

  1. Use CTAS to create new table with values you want to keep. If it's hard to phrase query this way, you can always do something like where id not in (select id ...) or select * from ... except select * from .... If your table is partitioned, and after deletion there should be more than 100 partitions, you'll need to use "insert into" technique to create up to 100 partitions per query (https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html).
  2. Move (just in case) original data from S3 for partitions that were relevant for the deletion
  3. Move data that was created by (1)
Kaplan Ilya
  • 479
  • 3
  • 12