6

I read all the answered question over here, 5 of them. And all of them are out of date about available AWS products right now.

So, as a newbie at AWS, I would like to know how to solve my problem or what is the best approach to solve my problem using only and only AWS solutions. I would like to avoid any third party. I know I'm going to cite one of the approaches I googled but just to refer it.

Anyway, I have a goal to achieve and this is basically replacing my SQL Server 2012 Integrations Services for something using only AWS products. Right now I'm accessing an FTP server and downloading a bunch of CSV files to my drive, reading them, transforming them into my datasets and loading them into my specified tables. This process is scheduled to execute 3 times every single day.

My initial proposal was to upload files to S3, use AWS Glue Crawlers to crawl my files and fill my self-created AWS Glue Data Catalogs, them ETL to my RDS. So far I could achieve my Lambda Function to connect my FTP and upload to my S3, also I could retrieve my data using AWS Athena, just to see if all things were working fine.

But now, I'm struggling to make my ETL copy/create my table into RDS and write the data. I created My Glue Connection under the same RDS VPC, subnet and security group, also my security group has All TCP from anywhere inbound (I know, I'm not leaving this, it is just for tests) and I'm using JDBC, writing the following JDBC URL:

jdbc:sqlserver://my-database-name.xsdfxsdsfsfsx.us-east-1.rds.amazonaws.com:1433;databaseName=my-database-name

I could test my created connection using "Test Connection" inside AWS Glue, and it worked fine. But after creating my Job using the Job tutorial and running it, inside my log errors I can see this:

com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 167.254.77.1:8088 [/167.254.77.1] failed: Connection refused (Connection refused)

I tried to create a connection using Amazon RDS option, but on the second screen after picking the instance I am receiving the following error:

Unable to find a suitable security group. Change connection type to JDBC and retry adding your connection.

I checked my IAM and I do have the AWSGlueServiceRoleDefault role within AWS service: glue trusted service and AWSGlueServiceRole for AWS managed policy as scripted in the documentation.

I would like to know what I'm missing or how to fix it to make it work. Or even if there is any better approach to achieve my goal.

Andrew Paes
  • 1,940
  • 1
  • 15
  • 20
  • Have you followed all the steps on the [docs](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html). Note the requirement for self-referencing group as well. – Zaheer Oct 18 '17 at 17:37
  • Yes, I did. Also I'm having AWS Support. I sahred my screen with them and they worked on ti but no success untill now. They are going to investigate what could be the problem. Meanwhile I'm reading again all the docs, foruns, and doing again all my environment. Also creating my py script on my server trying to use their SDK to debug and trying to find what could I'm missing. – Andrew Paes Oct 18 '17 at 17:56

1 Answers1

15

Well, fist of all I had 2 problems to solve. Im going to show how I did solve both of them.

1 - Create a connection using the "Amazon RDS" option;

My JDBC connection was working fine because I was explicit specifying which security group I wanted to be used, I called it "sg-glue", and this "sg-glue" security group was allowed in my inbound database security group, also my "sg-glue" had "All TCP" from anywhere allowed for inbound.

So when I was trying to create my connection using "Amazon RDS" I didn't notice in time this is just an easy way to create the very same JDBC connection, but you don't have the choice to specify which security group you want to use. On this way is applied the very same database security group to this connection, and this was the reason I was seeing that security group error, as my database security group didn't have "All TCP" allowed.

Reading back the documentation I could see what I was doing wrong, in fact I was trying to set a midway security group to protect my database. But as documentaion says I need to have to give "All TCP" permission right into my databse security group, so I did reset it doing the following steps:

  1. In the Amazon RDS left navigation pane, choose Instances.
  2. Choose the Amazon RDS Engine and DB Instance name that you want to access from AWS Glue.
  3. From Instance Actions, choose See Details. On the Details tab, find the Security Groups name you will access from AWS Glue. Record the name of the security group for future reference.
  4. Choose the security group to open the Amazon EC2 console.
  5. Confirm that your Group ID from Amazon RDS is chosen, then choose the Inbound tab.
  6. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID. RDS Inbound

  7. Add a rule to for outbound traffic also. Either open outbound traffic to all ports or create a self-referencing rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID. enter image description here

After all these steps, both connection were working fine.

2 - Use my created connection to allow me to ETL my data to my RDS;

So having my connection working successfully, I tried to make my job connect into my RDS and create my table and write on ti. But I couldn't, I was receiving an error:

Py4JJavaError: An error occurred while calling o74.pyWriteDynamicFrame. java.lang.NullPointerException

And I didn't know why. I tried everything, than I called AWS Support. For my best luck I had such a HERO as support, called Li H. this girl worked a lot and went every single corner to find what was happening, asked every single person at Amazon and after 4 days working, meanwhile we shared my screens, created so many times new environments to find the cause of this problem, since VPC, SG, DNS, and when we were hopeless she had a bulb light over her head and asked to change my database name to "testing"...

So I created a new connection. Created a new Job using this new connection. And it worked.

Conclusion, you can't have the same name for database and instance. They need to be different names.

Public tnx special to this girl, Li H. And very nice to have such a good professional support. Also:

  • username used on the connection need to have "create table" permissions.

  • If your job writes to a Microsoft SQL Server table, and the table has columns defined as type Boolean, then the table must be predefined in the SQL Server database.

  • Your IAM role needs to have the AWS Glue policy attached on it

Spencer Sutton
  • 2,907
  • 2
  • 18
  • 19
Andrew Paes
  • 1,940
  • 1
  • 15
  • 20
  • Well appreciated for your time in detailed problem solving details, and good write-up. I am working on a related migration, and this would be good reference. Do you have any approach for UPSERTs using pyspark to RDS? Am looking for some good solutions, unfortunately there is none in the google (surprisingly). Appreciate if you can provide some pseudo-code / approaches, if you are working on pyspark. Thanks – Yuva Jun 09 '18 at 17:08
  • Hino, Im sorry tô bem late, Just saw it now. I didnt do mundo upsert exactly using py, i solved this cretating a lambda function to called from my glue job at the end. This lamba function was calling my procedure and doing my upsert. Nevertheless i think to do your upsert you could use the same approach. I canal post this, but i think it is too long to post here. Só ask a question and post the link here. Problably monday att morning I answer it. Time to be daddy right now. – Andrew Paes Jun 10 '18 at 18:29
  • Thanks Andrew, I have already posted a question related to this, and here is the link. Appreciate if you can provide some light. Link: https://stackoverflow.com/questions/50613173/pyspark-update-certain-columns-in-mysql-table – Yuva Jun 11 '18 at 00:54
  • Reason: Could not find S3 endpoint or NAT gateway for subnetId .i am getting this error even after setting self-referencing .any other reason for same – Vish Jul 19 '18 at 15:50
  • 1
    @Vish You need to creat a security group for Glue and you need to have an Amazon S3 VPC endpoint https://docs.aws.amazon.com/glue/latest/dg/vpc-endpoints-s3.html – Andrew Paes Jul 19 '18 at 16:50
  • @AndrewPaes I am trying to connec to rds by jdbc .do i still vpc endpoint? – Vish Jul 19 '18 at 20:13
  • Glue needs a Amazon S3 endpoint to ETL the data. You can see this when you are setting a job. – Andrew Paes Jul 19 '18 at 20:24
  • Thanks a lot! It works fine! What actually means the self reference rule? – Fucio Feb 12 '21 at 11:54