2

I am doing an evaluation of AWS database services to pick the most effective one, the objective is to load data from a json file from an S3 bucket into Redshift every 5 minutes.

I am currently trying to use AWS Data Pipeline for the automation of ETL. I have been following this AWS tutorial "Copy Data to Amazon Redshift Using the AWS Data Pipeline Console", all simple and clear.

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-copydata-redshift-create.html

I set up a cluster on Redshift and a bucket on S3, created all the roles and policies required with all the permissions needed.

Now after creating the pipeline, and pressing Activate a warning comes up saying

Errors/Warnings:

Object:Ec2Instance
WARNING: Could not validate S3 Access for role. Please ensure role ('DataPipelineDefaultRole') has s3:Get*, s3:List*, s3:Put* and sts:AssumeRole permissions for DataPipeline.

Now I am sure that my role and resource role have s3:Get*, s3:List*, s3:Put* and sts:AssumeRole

In fact they both have FullAccess to basically all the services I need.

DataPipelineDefaultRole policy:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "VisualEditor0",
        "Effect": "Allow",
        "Action": "iam:CreateServiceLinkedRole",
        "Resource": "*",
        "Condition": {
            "StringLike": {
                "iam:AWSServiceName": [
                    "elasticmapreduce.amazonaws.com",
                    "spot.amazonaws.com"
                ]
            }
        }
    },
    {
        "Sid": "VisualEditor1",
        "Effect": "Allow",
        "Action": [
            "ec2:AuthorizeSecurityGroupIngress",
            "sdb:Select*",
            "sqs:ReceiveMessage",
            "s3:Get*",
            "sqs:GetQueue*",
            "s3:CreateBucket",
            "sns:Unsubscribe",
            "s3:List*",
            "datapipeline:EvaluateExpression",
            "ec2:StartInstances",
            "dynamodb:DescribeTable",
            "sqs:Delete*",
            "iam:ListAttachedRolePolicies",
            "ec2:RevokeSecurityGroupEgress",
            "dynamodb:GetItem",
            "sns:Subscribe",
            "iam:ListRolePolicies",
            "s3:DeleteObject",
            "sdb:BatchPutAttributes",
            "iam:GetRole",
            "dynamodb:BatchGetItem",
            "redshift:DescribeClusterSecurityGroups",
            "ec2:CreateTags",
            "ec2:DeleteNetworkInterface",
            "ec2:RunInstances",
            "dynamodb:Scan",
            "rds:DescribeDBSecurityGroups",
            "ec2:StopInstances",
            "ec2:CreateNetworkInterface",
            "ec2:CancelSpotInstanceRequests",
            "cloudwatch:*",
            "sqs:PurgeQueue",
            "iam:GetRolePolicy",
            "dynamodb:UpdateTable",
            "ec2:RequestSpotInstances",
            "ec2:DeleteTags",
            "sns:ListTopics",
            "ec2:ModifyImageAttribute",
            "iam:PassRole",
            "sns:Publish",
            "ec2:DescribeNetworkInterfaces",
            "ec2:CreateSecurityGroup",
            "rds:DescribeDBInstances",
            "ec2:ModifyInstanceAttribute",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:DetachNetworkInterface",
            "ec2:TerminateInstances",
            "iam:GetInstanceProfile",
            "sns:GetTopicAttributes",
            "datapipeline:DescribeObjects",
            "dynamodb:Query",
            "iam:ListInstanceProfiles",
            "ec2:Describe*",
            "ec2:DeleteSecurityGroup",
            "redshift:DescribeClusters",
            "sqs:CreateQueue",
            "elasticmapreduce:*",
            "s3:Put*"
        ],
        "Resource": "*"
    },
    {
        "Sid": "VisualEditor2",
        "Effect": "Allow",
        "Action": [
            "iam:PassRole",
            "s3:Get*",
            "s3:List*",
            "s3:Put*",
            "sts:AssumeRole"
        ],
        "Resource": [
            "arn:aws:iam::*:role/DataPipelineDefaultResourceRole",
            "arn:aws:iam::*:role/DataPipelineDefaultRole",
            "arn:aws:s3:::*/*"
        ]
    },
    {
        "Sid": "VisualEditor3",
        "Effect": "Allow",
        "Action": [
            "s3:Get*",
            "s3:List*",
            "s3:Put*"
        ],
        "Resource": "arn:aws:s3:::*"
    },
    {
        "Sid": "VisualEditor4",
        "Effect": "Allow",
        "Action": [
            "s3:Get*",
            "s3:List*",
            "s3:Put*"
        ],
        "Resource": "*"
    }
]
}

DataPipelineDefaultResourceRole policy:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "cloudwatch:*",
            "datapipeline:*",
            "dynamodb:*",
            "ec2:Describe*",
            "elasticmapreduce:AddJobFlowSteps",
            "elasticmapreduce:Describe*",
            "elasticmapreduce:ListInstance*",
            "rds:Describe*",
            "redshift:DescribeClusters",
            "redshift:DescribeClusterSecurityGroups",
            "s3:*",
            "sdb:*",
            "sns:*",
            "sqs:*"
        ],
        "Resource": [
            "*"
        ]
    }
]
}

I Have been looking into the issue for over a week, tried all the existing solutions, updated the Trust Relationship, recreated the Roles, kept the default roles, let Data Pipeline create new ones and checked the Security Group, still having the same issue.

After activating the pipeline and check the log Uri, I do find 2 folders Ec2Instance and RedshiftLoadActivity, in the Redshift log file there are only 2 lines, the other has more [INFO] describing the downloading of the jar and S3 file for the TaskRunner.

Among the logs There are [INFO] and these [WARN]:

Ec2Instance:

private.com.amazonaws.services.s3.internal.S3V4AuthErrorRetryStrategy: Attempting to re-send the request to mylogbucket.s3.eu-central-1.amazonaws.com with AWS V4 authentication. To avoid this warning in the future, please use region-specific endpoint to access buckets located in regions that require V4 signing.

RedshiftLoadActivity:

private.com.amazonaws.services.s3.internal.S3V4AuthErrorRetryStrategy: Attempting to re-send the request to mylogbucket.s3.eu-central-1.amazonaws.com with AWS V4 authentication. To avoid this warning in the future, please use region-specific endpoint to access buckets located in regions that require V4 signing.

The issue should be the roles and policies, yet I made sure Redshift and S3 bucket were not the problem since I tried using a COPY command on the Query Editor and it loaded the data as expected.

I am currently still stuck in that error and would love some suggestions to how I could possibly fix this problem.

548 [ERROR] (TaskRunnerService-resource:df-0539055_@Ec2Instance_2019-05-30T13:38:35-0) amazonaws.datapipeline.database.ConnectionFactory: Unable to establish connection to jdbc:postgresql://redshift-cluster-1.coykb9.eu-central-1.redshift.amazonaws.com:5439/db Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections

singh30
  • 1,335
  • 17
  • 22
shini
  • 188
  • 2
  • 13
  • Please include you policy declarations. – Istvan May 29 '19 at 09:07
  • There, I added both policies. – shini May 29 '19 at 09:17
  • It looks like you are missing assumerole for EC2. – Istvan May 29 '19 at 09:27
  • I have to add "sts:AssumeRole" in the action of the DataPipelineDefaultRole right?I did that and tried again but still getting the same warning. – shini May 29 '19 at 09:35
  • No, I don not think so. I think you need to add it on the instance profile of EC2 nodes if I am not mistaken. The problem is I am not familiar with the DataPipepline service too much, but generally speaking the EC2 instance profile has sts:AssumeRole. – Istvan May 29 '19 at 10:13

1 Answers1

0

Data pipeline uses EMR to spawn EC2 instances and complete the submitted task.

Check EC2 instance profile and EMR role of the EMR spawned by data pipeline. Attach S3 access policy to EC2 instance profile role of EMR.

By default EC2 instance profile is DataPipelineDefaultResourceRole

For fixing

Error Unable to establish connection to jdbc:postgresql://redshift-cluster-1.coykb9.eu-central-1.redshift.amazonaws.com:5439/db Connection refused

Update your redshift security group Inbound rules to allow connection from 0.0.0.0/0. This means any machine over internet is allowed to connect using credentials.

singh30
  • 1,335
  • 17
  • 22
  • I don't have any instance in EC2 and no cluster in EMR either, didn't need to create any. But I did add S3 and Data Pipeline FullAccess to both these roles anyway in case it somehow is being used, tried again and still getting the same warning. When I check the execution of the Pipeline the S3InputDataNode's status is `FINISHED` while the status for the Ec2Instance is `RUNNING`, it does 3 attemps and fails every one of them. Once it's done the whole pipeline's status becomes `FAILED`. – shini May 29 '19 at 12:52
  • You can check the stderr.log for exact reason of failure. – singh30 May 29 '19 at 16:45
  • I did and the only error I get is `548 [ERROR] (TaskRunnerService-resource:df-0539055_@Ec2Instance_2019-05-30T13:38:35-0) amazonaws.datapipeline.database.ConnectionFactory: Unable to establish connection to jdbc:postgresql://redshift-cluster-1.coykb9.eu-central-1.redshift.amazonaws.com:5439/db Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.` My Security Group has as Outbound Destination 0.0.0.0/0 and accepts all traffic. – shini May 31 '19 at 07:49
  • Update your redshift security group Inbound rules to allow connection from 0.0.0.0/0 – singh30 May 31 '19 at 07:55
  • Thanks, that actually fixed the jdbc connection issue, it all works now and I can actually load data from S3 to Redshift but for some reason the Roles Warning is still showing. – shini May 31 '19 at 13:59
  • Can you share the SQL command used in Redshift to copy the data from S3? – Tewfik Ghariani Jun 01 '19 at 12:45
  • AWS doc: https://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html explains how to compose the copy command to copy data from s3 to redshift. – singh30 Jun 01 '19 at 15:58
  • @TewfikGhariani I am not using an SQL command, am using Pipeline which is using the default COPY command, but the issue I was having and the warning am still getting have nothing to do with the command itself, both are caused by the roles/policies and security group. – shini Jun 01 '19 at 19:23
  • Exactly that's why I'm asking. In the documentation it was mentioned that you need to specify the IAM role so I thought that might have been misconfigured https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-learn-copy-redshift.html – Tewfik Ghariani Jun 02 '19 at 15:37
  • You use that COPY command in the Redshift Editor when you want to manually do the ETL from S3 to Redshift, but Data Pipeline generates the COPY command with the roles automatically, and the IAM roles configuration was also made by default when creating the pipeline (I posted the policies in use above). – shini Jun 03 '19 at 07:33