AWS Datapipeline RedShiftCopyActivity - how to specify "columns"

Question

I am trying to copy a bunch of csv files from S3 to Redshift using the RedShiftCopyActivity and a datapipeline.

This works fine as long as the csv structure matches the table structure. In my case the csv has less columns than the table and then the RedShiftCopyActivity failes with a "Delimiter not found" error in stl_load_errors.

I would like to use the redshift copy command "columns" option. That way I can make it work, but the columns part of the redshift copy command does not seem to be available in the RedShiftCopyActivity.

Anyone any suggestions?

All tips warmly welcomed.

Many thanks upfront.

Peter

No solution. We were able to avoid the issue by developing a python script that is started by the pipeline and that invokes the redshift copy command. Less elegant than I'd like but at least it works. — Peter, Feb 19 '15 at 14:41
Thanks @Peter. As it happens, I've just written a Python script to replace the whole Data Pipeline. Had endless issues with the redshiftCopyActivity and the Data Pipeline console. — Erve1879, Feb 19 '15 at 15:09
The `transformSql` option for the [RedshiftCopyActivity](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html) states the following: *"The SQL SELECT expression used to transform the input data."* Further, it says that data is loaded into a table named `staging`, which your transforming `SELECT` statements should be run against, and that the `output schema of transformSql must match the final target table's schema.` Perhaps this is the option you were looking for? — Adrian Torrie, Jun 06 '15 at 12:31

score 1 · Answer 1 · answered Apr 04 '16 at 08:29

1

I know this is an old question but now you can specify a list of columns to the Redshift COPY command.

COPY tablename (column1 [,column2, ...])

When loading data from S3, the column order needs to match the order of the source data. Check out the docs here: Amazon Redshift Column Mapping Options.

Radu

answered Apr 04 '16 at 08:29

Radu-Stefan Zugravu

99
9

Thanks for letting me know Radu. We gave up on DataPipeLine and wrote a python script. For now running on an ec2 instance, but I guess we can consider moving to lambda. – Peter Apr 04 '16 at 20:44
There is a really good blog post on Amazon Redshift and Lambda. We already use this and works like a charm. You should check it out: [A Zero-Administration Amazon Redshift Database Loader](https://blogs.aws.amazon.com/bigdata/post/Tx24VJ6XF1JVJAA/A-Zero-Administration-Amazon-Redshift-Database-Loader) – Radu-Stefan Zugravu Apr 05 '16 at 06:59

AWS Datapipeline RedShiftCopyActivity - how to specify "columns"

1 Answers1