1

Any way to have the query still continue running even if the original calling client has shut down?

I have an ETL server with 64 cores.

I want to run a COPY command after I process many files per day.

COPY takes a really long time and that ETL server should only exist if it has more files to process, it's a waste of money to run it waiting on SQL Copy commands.

I could send a ready status to SQS and have it be picked up by a nano server, which can wait on SQL commands to finish all day without worry.

But it would probably better if I could just submit the SQL to the redshift server and have it work on it async.

Commands like psql -c "query..." will block until query is finished. If the psql process gets interrupted, it will cancel and roll back thq query. Is there a way to send an async query that does not rely on the server being online to complete the query.

  • 1
    It's not possible - that's how JDBC /ODBC work. This answer has a good explanation https://stackoverflow.com/a/7616094/1680826. – botchniaque Nov 11 '19 at 22:26
  • I got you, but the redshift instances are ec2 machines aren't they. I was hoping there was some way to queue up some commands without it being tied to the client machine not shutting down. Async threads and that kind of thing still rely on the client process living – Potato Baggins Nov 11 '19 at 23:56
  • 1
    Amazon Redshift is based on a fork of PostgreSQL 8.0.x. Therefore, it will be have much like PostgreSQL. The virtual hardware does not have an impact. – John Rotenstein Nov 12 '19 at 01:24
  • Like most databases, Redshift runs SQL commands in a session opened by a client - that session needs to persist while the command runs, certainly for most DDL and DML, I'm fairly sure that applies to COPY as well. So your suggestion to run a small instance to execute SQL queries from is a reasonable one in my view if you don't want to use your main ETL server. – Nathan Griffiths Nov 12 '19 at 02:59
  • Well, you can get around with Lambda. You can simply invoke the lambda function async from your ETL server. Set the lambda timeout to something like 5-10 secs and see the magic happening: COPY, UNLOAD and VACUUM statements won't stop after lambda times out. – arthurq Mar 01 '23 at 19:09

2 Answers2

0

Yes, look at EXTERNAL FUNCTION in Redshift. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_FUNCTION.html

It's a Lambda UDF which can be executed synchronously but ingest all your files asynchronously.

CREATE EXTERNAL FUNCTION ingest_files() 
RETURNS VARCHAR
STABLE 
LAMBDA 'ingest files' 
IAM_ROLE 'arn:aws:iam::123456789012:role/Redshift-Ingest-Test';

ingest_files is synchronous. It identifies file list to ingest and passes it to another asynchronous lambda function. You can execute it like this:

select ingest_files();

ingest_files 
------------
 10 files submitted.
(1 row)

Now inside ingest_files you kick of another Lambda function invoke_ingest_files

#ingest_files.py
...
payload = [..file names...]
response = lambda_client.invoke(
    InvocationType='Event', 
    FunctionName='invoke_ingest_files',
    Payload=json.dumps(payload)
    )
...
Dharman
  • 30,962
  • 25
  • 85
  • 135
Alex B
  • 2,165
  • 2
  • 27
  • 37
  • 1
    You can simply invoke the lambda function async from your ETL server. No need to create the UDF on Redshift for this case. Set the lambda timeout to something like 5-10 secs and see the magic happening: COPY, UNLOAD and VACUUM statements won't stop after lambda times out. – arthurq Mar 01 '23 at 19:07
0

Look into Redshift Data API as it might meet your needs - https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html

You submit the query to Data API and it runs the query and you can poll for the result as you wish. May not work with your solution but an option to consider.

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18